├── README.md
├── assets
    ├── develope.jpg
    ├── paper.json
    ├── timeline.jpg
    ├── timeline.png
    └── timeline_2.png
└── src
    ├── list.md
    ├── main.py
    ├── table.md
    └── timeline.png


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome-System2-Reasoning-LLM
  2 | 
  3 | [![arXiv](https://img.shields.io/badge/arXiv-Slow_Reason_System-b31b1b.svg)](http://arxiv.org/abs/2502.17419) 
  4 | [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/zzli2022/System2-Reasoning-LLM)
  5 | [![Last Commit](https://img.shields.io/github/last-commit/zzli2022/Awesome-System2-Reasoning-LLM)](https://github.com/zzli2022/System2-Reasoning-LLM)
  6 | [![Contribution Welcome](https://img.shields.io/badge/Contributions-welcome-blue)]()
  7 | 
  8 | <!-- omit in toc -->
  9 | ## 📢 Updates
 10 | 
 11 | - **2025.02**: We released a survey paper "[From System 1 to System 2: A Survey of Reasoning Large Language Models](http://arxiv.org/abs/2502.17419)". Feel free to cite or open pull requests.
 12 | 
 13 | <!-- omit in toc -->
 14 | ## 👀 Introduction
 15 | 
 16 | Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to [our survey paper](http://arxiv.org/abs/2502.17419).
 17 | 
 18 | Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAI’s o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field.
 19 | 
 20 | 
 21 | ![image](./assets/develope.jpg)
 22 | 
 23 | This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges.
 24 | 
 25 | ![image](./assets/timeline_2.png)
 26 | The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.
 27 | 
 28 | <!-- omit in toc -->
 29 | ## 📒 Table of Contents
 30 | 
 31 | - [Awesome-System-2-AI](#awesome-system-2-ai)
 32 |   - [Part 1: O1 Replication](#part-1-o1-replication)
 33 |   - [Part 2: Process Reward Models](#part-2-process-reward-models)
 34 |   - [Part 3: Reinforcement Learning](#part-3-reinforcement-learning)
 35 |   - [Part 4: MCTS/Tree Search](#part-4-mctstree-search)
 36 |   - [Part 5: Self-Training / Self-Improve](#part-5-self-training--self-improve)
 37 |   - [Part 6: Reflection](#part-6-reflection)
 38 |   - [Part 7: Efficient System2](#part-7-efficient-system2)
 39 |   - [Part 8: Explainability](#part-8-explainability)
 40 |   - [Part 9: Multimodal Agent related Slow-Fast System](#part-9-multimodal-agent-related-slow-fast-system)
 41 |   - [Part 10: Benchmark and Datasets](#part-10-benchmark-and-datasets)
 42 |   - [Part 11: Reasoning and Safety](#part-11-reasoning-and-safety)
 43 |   - [Part 12: R1 Driven Multimodal Reasoning Enhancement](#part-12-r1-driven-multimodal-reasoning-enhancement)
 44 | 
 45 | ## Part 1: O1 Replication
 46 | 
 47 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 48 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 49 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 50 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 51 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 52 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 53 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 54 | * DRT: Deep Reasoning Translation via Long Chain-of-Thought [[Paper]](https://arxiv.org/abs/2412.17498) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 55 | * mini-deepseek-r1 [[Blog]](https://www.philschmid.de/mini-deepseek-r1) ![](https://img.shields.io/badge/blog-2025.01-red)
 56 | * Run DeepSeek R1 Dynamic 1.58-bit [[Blog]](https://unsloth.ai/blog/deepseekr1-dynamic) ![](https://img.shields.io/badge/blog-2025.01-red)
 57 | * Simple Reinforcement Learning for Reasoning [[Notion]](https://hkust-nlp.notion.site/simplerl-reason) ![](https://img.shields.io/badge/Notion-2025.01-red)
 58 | * TinyZero [[github]](https://github.com/Jiayi-Pan/TinyZero) ![](https://img.shields.io/badge/github-2025.01-red)
 59 | * Open R1 [[github]](https://github.com/huggingface/open-r1) ![](https://img.shields.io/badge/github-2025.01-red)
 60 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 61 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 62 | * The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [[Paper]](https://arxiv.org/abs/2502.15631) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 63 | * Open-Reasoner-Zero [[Paper]](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) ![](https://img.shields.io/badge/pdf-2025.02-red)
 64 | * X-R1 [[github]](https://github.com/dhcode-cpp/X-R1) ![](https://img.shields.io/badge/github-2025.02-red)
 65 | * Unlock-Deepseek [[Blog]](https://mp.weixin.qq.com/s/Z7P61IV3n4XYeC0Et_fvwg) ![](https://img.shields.io/badge/blog-2025.02-red)
 66 | * Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.14768) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 67 | * LLM-R1 [[github]](https://github.com/TideDra/lmm-r1) ![](https://img.shields.io/badge/github-2025.02-red)
 68 | ## Part 2: Process Reward Models
 69 | 
 70 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) ![](https://img.shields.io/badge/arXiv-2022.11-red)
 71 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) ![](https://img.shields.io/badge/arXiv-2023.06-red)
 72 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) ![](https://img.shields.io/badge/arXiv-2023.06-red)
 73 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) ![](https://img.shields.io/badge/ACL-2024-blue)
 74 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) ![](https://img.shields.io/badge/ACL_Findings-2024-blue)
 75 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 76 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) ![](https://img.shields.io/badge/arXiv-2024.06-red)
 77 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
 78 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
 79 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 80 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 81 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 82 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 83 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 84 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 85 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) ![](https://img.shields.io/badge/ICLR-2025-blue)
 86 | * Uncertainty-Aware Step-wise Verification with Generative Reward Models [[Paper]](https://arxiv.org/abs/2502.11250) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 87 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://www.arxiv.org/abs/2502.13943) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 88 | * Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [[Paper]](https://www.arxiv.org/abs/2502.08922) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 89 | * Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.06703) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 90 | * Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [[Paper]](https://arxiv.org/abs/2502.19328) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 91 | * Unified Reward Model for Multimodal Understanding and Generation [[Paper]](https://arxiv.org/abs/2503.05236) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 92 | * Reward Shaping to Mitigate Reward Hacking in RLHF [[Paper]](https://arxiv.org/abs/2502.18770) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 93 | * Multi-head Reward Aggregation Guided by Entropy [[Paper]](https://arxiv.org/abs/2503.20995) ![](https://img.shields.io/badge/arXiv-2025.03-red)
 94 | *   [[Paper]](https://arxiv.org/abs/2503.21295) ![](https://img.shields.io/badge/arXiv-2025.03-red)
 95 | * Better Process Supervision with Bi-directional Rewarding Signals [[Paper]](https://arxiv.org/abs/2503.04618) ![](https://img.shields.io/badge/arXiv-2025.03-red)
 96 | * Inference-Time Scaling for Generalist Reward Modeling [[Paper]](https://arxiv.org/abs/2504.02495) ![](https://img.shields.io/badge/arXiv-2025.04-red)
 97 | 
 98 | ## Part 3: Reinforcement Learning
 99 | 
100 | * Improve Vision Language Model Chain-of-thought Reasoning [[Paper]](https://arxiv.org/abs/2410.16198) ![](https://img.shields.io/badge/arXiv-2024.10-red)
101 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) ![](https://img.shields.io/badge/arXiv-2024.12-red)
102 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) ![](https://img.shields.io/badge/arXiv-2024.12-red)
103 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue)
104 | * InfAlign: Inference-aware language model alignment [[Paper]](https://arxiv.org/abs/2412.19792) ![](https://img.shields.io/badge/arXiv-2024.12-red)
105 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) ![](https://img.shields.io/badge/arXiv-2025.01-red)
106 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) ![](https://img.shields.io/badge/arXiv-2025.01-red)
107 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) ![](https://img.shields.io/badge/arXiv-2025.01-red)
108 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) ![](https://img.shields.io/badge/arXiv-2025.01-red)
109 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) ![](https://img.shields.io/badge/arXiv-2024.02-red)
110 | * Reasoning with Reinforced Functional Token Tuning [[Paper]](https://arxiv.org/abs/2502.13389) ![](https://img.shields.io/badge/arXiv-2025.02-red)
111 | * Value-Based Deep RL Scales Predictably [[Paper]](https://arxiv.org/abs/2502.04327) ![](https://img.shields.io/badge/arXiv-2025.02-red)
112 | * MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [[Paper]](https://arxiv.org/abs/2502.10391) ![](https://img.shields.io/badge/arXiv-2025.02-red)
113 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) ![](https://img.shields.io/badge/arXiv-2025.02-red)
114 | * DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [[Paper]](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) ![](https://img.shields.io/badge/Notion-2025.02-red)
115 | * LIMR: Less is More for RL Scaling [[Paper]](https://arxiv.org/abs/2502.11886) ![](https://img.shields.io/badge/arXiv-2025.02-red)
116 | * A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [[Paper]](https://arxiv.org/abs/2502.143) ![](https://img.shields.io/badge/arXiv-2025.02-red)
117 | * Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.19655) ![](https://img.shields.io/badge/arXiv-2025.02-red)
118 | * QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [[Paper]](https://arxiv.org/abs/2502.02584) ![](https://img.shields.io/badge/arXiv-2025.02-red)
119 | * Process Reinforcement through Implicit Rewards [[Paper]](https://arxiv.org/abs/2502.01456) ![](https://img.shields.io/badge/arXiv-2025.02-red)
120 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.21620) ![](https://img.shields.io/badge/arXiv-2025.03-red)
121 | * All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01067) ![](https://img.shields.io/badge/arXiv-2025.03-red)
122 | * R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [[Paper]](https://arxiv.org/abs/2503.05132) ![](https://img.shields.io/badge/arXiv-2025.03-red)
123 | * Visual-RFT: Visual Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01785) ![](https://img.shields.io/badge/arXiv-2025.03-red)
124 | * GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [[Paper]](https://arxiv.org/abs/2503.08525) ![](https://img.shields.io/badge/arXiv-2025.03-red)
125 | * L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.04697) ![](https://img.shields.io/badge/arXiv-2025.03-red)
126 | * Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [[Paper]](https://arxiv.org/abs/2503.16219) ![](https://img.shields.io/badge/arXiv-2025.03-red)
127 | *  Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement [[Paper]](https://arxiv.org/abs/2503.07065) ![](https://img.shields.io/badge/arXiv-2025.03-red)
128 | *  VLAA-Thinker [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/) ![](https://img.shields.io/badge/github-2025.04-red)
129 | *  Concise Reasoning via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05185) ![](https://img.shields.io/badge/arXiv-2025.04-red)
130 | *  d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [[github]](https://dllm-reasoning.github.io/media/preprint.pdf) ![](https://img.shields.io/badge/github-2025.04-red)
131 | *  Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05108) ![](https://img.shields.io/badge/arXiv-2025.04-red)
132 | *  Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [[Paper]](https://arxiv.org/pdf/2504.05520) ![](https://img.shields.io/badge/arXiv-2025.04-red)
133 | *  VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2504.06958) ![](https://img.shields.io/badge/arXiv-2025.04-red)
134 | *  SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [[Paper]](https://arxiv.org/abs/2504.11468) ![](https://img.shields.io/badge/arXiv-2025.04-red)
135 | *  RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models [[Paper]](https://arxiv.org/abs/2504.07282) ![](https://img.shields.io/badge/arXiv-2025.04-red)
136 | *  MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.10160) ![](https://img.shields.io/badge/arXiv-2025.04-red)
137 | *  VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [[Paper]](https://arxiv.org/abs/2503.07523) ![](https://img.shields.io/badge/arXiv-2025.04-red)
138 | 
139 | ## Part 4: MCTS/Tree Search
140 | 
141 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) ![](https://img.shields.io/badge/EMNLP-2023-blue)
142 | * Fine-grained Conversational Decoding via Isotropic and Proximal Search [[Paper]](https://aclanthology.org/2023.emnlp-main.5/) ![](https://img.shields.io/badge/EMNLP-2023-blue)
143 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
144 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue)
145 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue)
146 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) ![](https://img.shields.io/badge/arXiv-2023.09-red)
147 | * Look-back Decoding for Open-Ended Text Generation [[Paper]](https://aclanthology.org/2023.emnlp-main.66/) ![](https://img.shields.io/badge/EMNLP-2023-blue)
148 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) ![](https://img.shields.io/badge/arXiv-2024.04-red)
149 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) ![](https://img.shields.io/badge/arXiv-2024.04-red)
150 | * Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [[Paper]](https://openreview.net/forum?id=CVpuVe1N22&noteId=aTI8PGpO47) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
151 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) ![](https://img.shields.io/badge/arXiv-2024.05-red)
152 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) ![](https://img.shields.io/badge/arXiv-2024.05-red)
153 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) ![](https://img.shields.io/badge/arXiv-2024.05-red)
154 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red)
155 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) ![](https://img.shields.io/badge/arXiv-2024.06-red)
156 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
157 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
158 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) ![](https://img.shields.io/badge/arXiv-2024.07-red)
159 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) ![](https://img.shields.io/badge/arXiv-2024.07-red)
160 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) ![](https://img.shields.io/badge/arXiv-2024.07-red)
161 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) ![](https://img.shields.io/badge/arXiv-2024.08-red)
162 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) ![](https://img.shields.io/badge/arXiv-2024.09-red)
163 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) ![](https://img.shields.io/badge/arXiv-2024.10-red)
164 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) ![](https://img.shields.io/badge/arXiv-2024.10-red)
165 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red)
166 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) ![](https://img.shields.io/badge/arXiv-2024.10-red)
167 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) ![](https://img.shields.io/badge/arXiv-2024.10-red)
168 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) ![](https://img.shields.io/badge/arXiv-2024.10-red)
169 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) ![](https://img.shields.io/badge/arXiv-2024.11-red)
170 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) ![](https://img.shields.io/badge/arXiv-2024.11-red)
171 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) ![](https://img.shields.io/badge/arXiv-2024.11-red)
172 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red)
173 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) ![](https://img.shields.io/badge/arXiv-2024.11-red)
174 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue)
175 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red)
176 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) ![](https://img.shields.io/badge/arXiv-2024.12-red)
177 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red)
178 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red)
179 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) ![](https://img.shields.io/badge/arXiv-2024.12-red)
180 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) ![](https://img.shields.io/badge/arXiv-2024.12-red)
181 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) ![](https://img.shields.io/badge/arXiv-2024.12-red)
182 | * Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [[Paper]](https://aclanthology.org/2024.naacl-short.42/) ![](https://img.shields.io/badge/NAACL-2024-blue)
183 | * Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2502.11169) ![](https://img.shields.io/badge/arXiv-2025.02-red)
184 | * PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [[Paper]](https://arxiv.org/abs/2501.13007) ![](https://img.shields.io/badge/arXiv-2025.01-red)
185 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) ![](https://img.shields.io/badge/ICLR-2025-blue)
186 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) ![](https://img.shields.io/badge/IEEE_TAC-2025-blue)
187 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red)
188 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red)
189 | * LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [[Paper]](https://arxiv.org/abs/2502.17925) ![](https://img.shields.io/badge/arXiv-2025.02-red)
190 | * Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [[Paper]](https://arxiv.org/abs/2502.02339) ![](https://img.shields.io/badge/arXiv-2025.02-red)
191 | * DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [[Paper]](https://arxiv.org/abs/2502.20730) ![](https://img.shields.io/badge/arXiv-2025.02-red)
192 | * Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2502.11881) ![](https://img.shields.io/badge/arXiv-2025.02-red)
193 | * VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [[Paper]](https://arxiv.org/abs/2504.09130) ![](https://img.shields.io/badge/arXiv-2025.04-red)
194 | ## Part 5: Self-Training / Self-Improve
195 | 
196 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) ![](https://img.shields.io/badge/NeurIPS-2017-blue)
197 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) ![](https://img.shields.io/badge/arXiv-2022.05-red)
198 | * Large Language Models are Better Reasoners with Self-Verification [[Paper]](/aclanthology.org/2023.findings-emnlp.167/) ![](https://img.shields.io/badge/ACL_Findings-2023-blue)
199 | * Self-Evaluation Guided Beam Search for Reasoning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/81fde95c4dc79188a69ce5b24d63010b-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
200 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
201 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) ![](https://img.shields.io/badge/arXiv-2023.08-red)
202 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
203 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) ![](https://img.shields.io/badge/arXiv-2024.02-red)
204 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) ![](https://img.shields.io/badge/arXiv-2024.03-red)
205 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR-2024-blue)
206 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) ![](https://img.shields.io/badge/arXiv-2024.05-red)
207 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) ![](https://img.shields.io/badge/arXiv-2024.06-red)
208 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://openreview.net/forum?id=pTHfApDakA) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
209 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
210 | * Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [[Paper]](https://openreview.net/forum?id=dcbNzhVVQj#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue)
211 | * Self-Improvement in Language Models: The Sharpening Mechanism [[Paper]](https://arxiv.org/abs/2412.01951) ![](https://img.shields.io/badge/arXiv-2024.12-red)
212 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red)
213 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
214 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red)
215 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) ![](https://img.shields.io/badge/TMLR-2024-blue)
216 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue)
217 | * Enabling Scalable Oversight via Self-Evolving Critic [[Paper]](https://arxiv.org/abs/2501.05727) ![](https://img.shields.io/badge/arXiv-2025.01-red)
218 | * S<sup>2</sup>R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [[Paper]](https://www.arxiv.org/abs/2502.12853) ![](https://img.shields.io/badge/arXiv-2025.02-red)
219 | * ProgCo: Program Helps Self-Correction of Large Language Models [[Paper]](https://arxiv.org/abs/2501.01264) ![](https://img.shields.io/badge/arXiv-2025.01-red)
220 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red)
221 | * Self-Training Elicits Concise Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20122) ![](https://img.shields.io/badge/arXiv-2025.02-red)
222 | * Language Models can Self-Improve at State-Value Estimation for Better Search [[Paper]](https://arxiv.org/abs/2503.02878) ![](https://img.shields.io/badge/arXiv-2025.03-red)
223 | * Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasonin [[Paper]](https://arxiv.org/abs/2504.08672) ![](https://img.shields.io/badge/arXiv-2025.04-red)
224 | * START: Self-taught Reasoner with Tools [[Paper]](https://arxiv.org/abs/2503.04625) ![](https://img.shields.io/badge/arXiv-2025.04-red)
225 | ## Part 6: Reflection
226 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://arxiv.org/abs/2308.00436) ![](https://img.shields.io/badge/arXiv-2023.08-red)
227 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) ![](https://img.shields.io/badge/arXiv-2023.10-red)
228 | * Learning From Mistakes Makes LLM Better Reasoner [[Paper]](https://arxiv.org/abs/2310.20689) ![](https://img.shields.io/badge/arXiv-2023.10-red)
229 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) ![](https://img.shields.io/badge/arXiv-2024.08-red)
230 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red)
231 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red)
232 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.18478) ![](https://img.shields.io/badge/arXiv-2024.11-red)
233 | * Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red)
234 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red)
235 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red)
236 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red)
237 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red)
238 | * Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [[Paper]](https://aclanthology.org/2024.findings-emnlp.500/) ![](https://img.shields.io/badge/EMNLP-2024-blue)
239 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red)
240 | * RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [[Paper]](https://arxiv.org/abs/2501.11284) ![](https://img.shields.io/badge/arXiv-2025.01-red)
241 | * Perception in Reflection [[Paper]](https://arxiv.org/abs/2504.07165) ![](https://img.shields.io/badge/arXiv-2025.04-red)
242 | ## Part 7: Efficient System2
243 | 
244 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) ![](https://img.shields.io/badge/arXiv-2024.10-red)
245 | * AutoReason: Automatic Few-Shot Reasoning Decomposition [[Paper]](https://arxiv.org/abs/2412.06975) ![](https://img.shields.io/badge/arXiv-2024.12-red)
246 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) ![](https://img.shields.io/badge/arXiv-2024.12-red)
247 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red)
248 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) ![](https://img.shields.io/badge/arXiv-2024.12-red)
249 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) ![](https://img.shields.io/badge/arXiv-2024.12-red)
250 | * From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [[Paper]](https://arxiv.org/abs/2501.16207) ![](https://img.shields.io/badge/arXiv-2025.01-red)
251 | * MALT: Improving Reasoning with Multi-Agent LLM Training [[Paper]](https://arxiv.org/abs/2412.01928) ![](https://img.shields.io/badge/arXiv-2024.12-red)
252 | * Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [[Paper]](https://arxiv.org/abs/2501.18585) ![](https://img.shields.io/badge/arXiv-2025.01-red)
253 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/abs/2501.19201) ![](https://img.shields.io/badge/arXiv-2025.01-red)
254 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) ![](https://img.shields.io/badge/arXiv-2025.01-red)
255 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) ![](https://img.shields.io/badge/arXiv-2025.01-red)
256 | * Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [[Paper]](https://www.arxiv.org/abs/2502.13260) ![](https://img.shields.io/badge/arXiv-2025.02-red)
257 | * Titans: Learning to Memorize at Test Time [[Paper]](https://arxiv.org/abs/2501.00663) ![](https://img.shields.io/badge/arXiv-2025.01-red)
258 | * MoBA: Mixture of Block Attention for Long-Context LLMs [[Paper]](https://arxiv.org/abs/2502.13189) ![](https://img.shields.io/badge/arXiv-2025.02-red)
259 | * One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [[Paper]](https://arxiv.org/abs/2502.10454) ![](https://img.shields.io/badge/arXiv-2025.02-red)
260 | * Small Models Struggle to Learn from Strong Reasoners [[Paper]](https://arxiv.org/abs/2502.12143) ![](https://img.shields.io/badge/arXiv-2025.02-red)
261 | * TokenSkip: Controllable Chain-of-Thought Compression in LLMs [[Paper]](https://arxiv.org/abs/2502.12067) ![](https://img.shields.io/badge/arXiv-2025.02-red)
262 | * SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2502.12134) ![](https://img.shields.io/badge/arXiv-2025.02-red)
263 | * Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [[Paper]](https://arxiv.org/abs/2502.10428) ![](https://img.shields.io/badge/arXiv-2025.02-red)
264 | * Thinking Preference Optimization [[Paper]](https://arxiv.org/abs/2502.13173) ![](https://img.shields.io/badge/arXiv-2025.02-red)
265 | * Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [[Paper]](https://arxiv.org/abs/2502.12215) ![](https://img.shields.io/badge/arXiv-2025.02-red)
266 | * Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [[Paper]](https://arxiv.org/abs/2502.12929) ![](https://img.shields.io/badge/arXiv-2025.02-red)
267 | * CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [[Paper]](https://arxiv.org/abs/2502.07316) ![](https://img.shields.io/badge/arXiv-2025.02-red)
268 | * OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [[Paper]](https://arxiv.org/abs/2502.11271) ![](https://img.shields.io/badge/arXiv-2025.02-red)
269 | * LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.11176) ![](https://img.shields.io/badge/arXiv-2025.02-red)
270 | * Atom of Thoughts for Markov LLM Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.12018) ![](https://img.shields.io/badge/arXiv-2025.02-red)
271 | * Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [[Paper]](https://arxiv.org/abs/2502.11147) ![](https://img.shields.io/badge/arXiv-2025.02-red)
272 | * Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [[Paper]](https://arxiv.org/abs/2502.12855) ![](https://img.shields.io/badge/arXiv-2025.02-red)
273 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) ![](https://img.shields.io/badge/arXiv-2025.02-red)
274 | * Scalable Language Models with Posterior Inference of Latent Thought Vectors [[Paper]](https://arxiv.org/abs/2502.01567) ![](https://img.shields.io/badge/arXiv-2025.02-red)
275 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) ![](https://img.shields.io/badge/arXiv-2025.02-red)
276 | * Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.03275) ![](https://img.shields.io/badge/arXiv-2025.02-red)
277 | * LightThinker: Thinking Step-by-Step Compression [[Paper]](https://arxiv.org/abs/2502.15589) ![](https://img.shields.io/badge/arXiv-2025.02-red)
278 | * The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [[Paper]](https://arxiv.org/pdf/2502.17416) ![](https://img.shields.io/badge/ICLR-2025-blue)
279 | * Reasoning with Latent Thoughts: On the Power of Looped Transformers [[Paper]](https://arxiv.org/pdf/2502.17416) ![](https://img.shields.io/badge/ICLR-2025-blue)
280 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/pdf/2501.19201) ![](https://img.shields.io/badge/arXiv-2025.01-red)
281 | * Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20332) ![](https://img.shields.io/badge/arXiv-2025.02-red)
282 | * Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [[Paper]](https://arxiv.org/abs/2502.11514) ![](https://img.shields.io/badge/arXiv-2025.02-red)
283 | * Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.19918) ![](https://img.shields.io/badge/arXiv-2025.02-red)
284 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[Paper]](https://arxiv.org/abs/2502.20238) ![](https://img.shields.io/badge/arXiv-2025.02-red)
285 | * MixLLM: Dynamic Routing in Mixed Large Language Models [[Paper]](https://arxiv.org/abs/2502.18482) ![](https://img.shields.io/badge/arXiv-2025.02-red)
286 | * PEARL: Towards Permutation-Resilient LLMs [[Paper]](https://arxiv.org/abs/2502.14628) ![](https://img.shields.io/badge/arXiv-2025.02-red)
287 | * Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [[Paper]](https://www.arxiv.org/abs/2502.07803) ![](https://img.shields.io/badge/arXiv-2025.03-red)
288 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[Paper]](https://arxiv.org/abs/2502.19361) ![](https://img.shields.io/badge/arXiv-2025.02-red)
289 | * Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [[Paper]](https://arxiv.org/abs/2502.19411) ![](https://img.shields.io/badge/arXiv-2025.02-red)
290 | * Training Large Language Models to be Better Rule Followers [[Paper]](https://arxiv.org/abs/2502.11525) ![](https://img.shields.io/badge/arXiv-2025.02-red)
291 | * Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [[Paper]](https://arxiv.org/abs/2502.04644) ![](https://img.shields.io/badge/arXiv-2025.02-red)
292 | * CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [[Paper]](https://arxiv.org/abs/2502.21074) ![](https://img.shields.io/badge/arXiv-2025.02-red)
293 | * SIFT: Grounding LLM Reasoning in Contexts via Stickers [[Paper]](https://arxiv.org/abs/2502.14922) ![](https://img.shields.io/badge/arXiv-2025.02-red)
294 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://arxiv.org/abs/2502.13943) ![](https://img.shields.io/badge/arXiv-2025.02-red)
295 | * How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [[Paper]](https://arxiv.org/abs/2503.01141) ![](https://img.shields.io/badge/arXiv-2025.03-red)
296 | * PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2503.02324) ![](https://img.shields.io/badge/arXiv-2025.03-red)
297 | * DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [[Paper]](https://arxiv.org/abs/2503.04472) ![](https://img.shields.io/badge/arXiv-2025.03-red)
298 | * Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [[Paper]](https://arxiv.org/abs/2503.04691) ![](https://img.shields.io/badge/arXiv-2025.03-red)
299 | * Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [[Paper]](https://arxiv.org/abs/2503.09567) ![](https://img.shields.io/badge/arXiv-2025.03-red)
300 | * TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [[Paper]](https://arxiv.org/abs/2503.04872) ![](https://img.shields.io/badge/arXiv-2025.03-red)
301 | * Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [[Paper]](https://arxiv.org/abs/2503.05641) ![](https://img.shields.io/badge/arXiv-2025.03-red)
302 | * Entropy-based Exploration Conduction for Multi-step Reasoning [[Paper]](https://arxiv.org/abs/2503.15848) ![](https://img.shields.io/badge/arXiv-2025.03-red)
303 | * MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [[Paper]](https://arxiv.org/abs/2503.16212) ![](https://img.shields.io/badge/arXiv-2025.03-red)
304 | * Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2503.16419) ![](https://img.shields.io/badge/arXiv-2025.03-red)
305 | * ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [[Paper]](https://arxiv.org/abs/2503.12918) ![](https://img.shields.io/badge/arXiv-2025.03-red)
306 | * Agent models: Internalizing Chain-of-Action Generation into Reasoning models [[Paper]](https://arxiv.org/abs/2503.06580) ![](https://img.shields.io/badge/arXiv-2025.03-red)
307 | * StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [[Paper]](https://arxiv.org/abs/2503.10105) ![](https://img.shields.io/badge/arXiv-2025.03-red)
308 | * Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [[Paper]](https://arxiv.org/abs/2503.10183) ![](https://img.shields.io/badge/arXiv-2025.03-red)
309 | * Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [[Paper]](https://arxiv.org/abs/2503.19877) ![](https://img.shields.io/badge/arXiv-2025.03-red)
310 | * Shared Global and Local Geometry of Language Model Embeddings [[Paper]](https://arxiv.org/abs/2503.21073) ![](https://img.shields.io/badge/arXiv-2025.03-red)
311 | * Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [[Paper]](https://arxiv.org/abs/2503.13360) ![](https://img.shields.io/badge/arXiv-2025.03-red)
312 | * Effectively Controlling Reasoning Models through Thinking Intervention [[Paper]](https://arxiv.org/abs/2503.24370) ![](https://img.shields.io/badge/arXiv-2025.03-red)
313 | * Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [[Paper]](https://arxiv.org/abs/2503.02318) ![](https://img.shields.io/badge/arXiv-2025.03-red)
314 | * TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [[Paper]](https://arxiv.org/abs/2504.09641) ![](https://img.shields.io/badge/arXiv-2025.04-red)
315 | * Lemmanaid: Neuro-Symbolic Lemma Conjecturing [[Paper]](https://arxiv.org/abs/2504.04942) ![](https://img.shields.io/badge/arXiv-2025.04-red)
316 | * ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning [[Paper]](https://arxiv.org/abs/2504.06650) ![](https://img.shields.io/badge/arXiv-2025.04-red)
317 | * Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [[Paper]](https://arxiv.org/abs/2504.05599) ![](https://img.shields.io/badge/arXiv-2025.04-red)
318 | * Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [[Paper]](https://arxiv.org/abs/2504.05419) ![](https://img.shields.io/badge/arXiv-2025.04-red)
319 | * Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? [[Paper]](https://arxiv.org/abs/2504.06514) ![](https://img.shields.io/badge/arXiv-2025.04-red)
320 | * Decentralizing AI Memory: SHIMI, a Semantic Hierarchical Memory Index for Scalable Agent Reasoning [[Paper]](https://arxiv.org/abs/2504.06135) ![](https://img.shields.io/badge/arXiv-2025.04-red)
321 | * Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [[Paper]](https://arxiv.org/pdf/2504.05632) ![](https://img.shields.io/badge/arXiv-2025.04-red)
322 | * Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [[Paper]](https://arxiv.org/abs/2504.00294) ![](https://img.shields.io/badge/arXiv-2025.04-red)
323 | * RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [[Paper]](https://arxiv.org/abs/2504.10081) ![](https://img.shields.io/badge/arXiv-2025.04-red)
324 | * Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [[Paper]](https://arxiv.org/abs/2504.11741) ![](https://img.shields.io/badge/arXiv-2025.04-red)
325 | 
326 | 
327 | ## Part 8: Explainability
328 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2024-blue)
329 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) ![](https://img.shields.io/badge/arXiv-2024.07-red)
330 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) ![](https://img.shields.io/badge/arXiv-2024.08-red)
331 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) ![](https://img.shields.io/badge/arXiv-2024.10-red)
332 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) ![](https://img.shields.io/badge/arXiv-2024.10-red)
333 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) ![](https://img.shields.io/badge/arXiv-2023.11-red)
334 | * Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [[Paper]](https://arxiv.org/abs/2501.04682) ![](https://img.shields.io/badge/arXiv-2025.01-red)
335 | * LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [[Paper]](https://arxiv.org/abs/2501.06186) ![](https://img.shields.io/badge/arXiv-2025.01-red)
336 | * Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [[Paper]](https://arxiv.org/abs/2502.19230) ![](https://img.shields.io/badge/arXiv-2025.02-red)
337 | * Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2503.11074) ![](https://img.shields.io/badge/arXiv-2025.03-red)
338 | * Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [[Paper]](https://arxiv.org/abs/2503.15558) ![](https://img.shields.io/badge/arXiv-2025.03-red)
339 | ## Part 9: Multimodal Agent related Slow-Fast System
340 | 
341 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red)
342 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red)
343 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red)
344 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) ![](https://img.shields.io/badge/arXiv-2024.12-red)
345 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) ![](https://img.shields.io/badge/arXiv-2024.12-red)
346 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) ![](https://img.shields.io/badge/arXiv-2025.01-red)
347 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) ![](https://img.shields.io/badge/ICLR-2025-blue)
348 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red)
349 | * I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [[Paper]](https://arxiv.org/abs/2502.10458) ![](https://img.shields.io/badge/arXiv-2025.02-red)
350 | * RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [[Paper]](https://arxiv.org/abs/2502.13957) ![](https://img.shields.io/badge/arXiv-2025.02-red)
351 | ## Part 10: Benchmark and Datasets
352 | 
353 | * Evaluation of OpenAI o1: Opportunities and Challenges of AGI [[Paper]](https://arxiv.org/abs/2409.18486) ![](https://img.shields.io/badge/arXiv-2024.09-red)
354 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) ![](https://img.shields.io/badge/arXiv-2024.09-red)
355 | * FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [[Paper]](https://arxiv.org/abs/2411.04872) ![](https://img.shields.io/badge/arXiv-2024.11-red)
356 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
357 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) ![](https://img.shields.io/badge/arXiv-2024.12-red)
358 | * EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [[Paper]](https://arxiv.org/abs/2502.12466) ![](https://img.shields.io/badge/arXiv-2025.02-red)
359 | * SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [[Paper]](https://arxiv.org/abs/2502.14739) ![](https://img.shields.io/badge/arXiv-2025.02-red)
360 | * Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [[Paper]](https://arxiv.org/abs/2502.14191) ![](https://img.shields.io/badge/arXiv-2025.02-red)
361 | * MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [[Paper]](https://arxiv.org/abs/2502.06453) ![](https://img.shields.io/badge/arXiv-2025.02-red)
362 | * LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [[Paper]](https://arxiv.org/abs/2501.15089) ![](https://img.shields.io/badge/arXiv-2025.01-red)
363 | * Humanity's Last Exam [[Paper]](https://arxiv.org/abs/2501.14249) ![](https://img.shields.io/badge/arXiv-2025.01-red)
364 | * RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [[Paper]](https://openreview.net/forum?id=QEHrmQPBdd)![](https://img.shields.io/badge/ICLR(Oral)-2025.01-blue)
365 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red)
366 | * Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [[Paper]](https://arxiv.org/abs/2502.17387) ![](https://img.shields.io/badge/arXiv-2025.02-red)
367 | * ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [[paper]](https://arxiv.org/abs/2502.09696) ![](https://img.shields.io/badge/arXiv-2025.02-red)
368 | * MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [[paper]](https://arxiv.org/abs/2502.09621) ![](https://img.shields.io/badge/arXiv-2025.02-red)
369 | * MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [[paper]](https://arxiv.org/abs/2502.00698) ![](https://img.shields.io/badge/arXiv-2025.02-red)
370 | * LR<sup>2</sup>Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [[Paper]](https://arxiv.org/abs/2502.17848) ![](https://img.shields.io/badge/arXiv-2025.02-red)
371 | * BIG-Bench Extra Hard [[Paper]](https://arxiv.org/abs/2502.19187) ![](https://img.shields.io/badge/arXiv-2025.02-red)
372 | * MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [[paper]](https://arxiv.org/abs/2502.20808) ![](https://img.shields.io/badge/arXiv-2025.02-red)
373 | * MastermindEval: A Simple But Scalable Reasoning Benchmark [[paper]](https://arxiv.org/abs/2503.05891) ![](https://img.shields.io/badge/arXiv-2025.03-red)
374 | * DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [[paper]](https://arxiv.org/abs/2503.15793) ![](https://img.shields.io/badge/arXiv-2025.03-red)
375 | * V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [[github]](https://github.com/haonan3/V1) ![](https://img.shields.io/badge/github-2025.03-red)
376 | * ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [[paper]](https://arxiv.org/abs/2503.21248) ![](https://img.shields.io/badge/arXiv-2025.03-red)
377 | * S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models [[paper]](https://arxiv.org/abs/2504.10368) ![](https://img.shields.io/badge/arXiv-2025.04-red)
378 | * When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks [[paper]](https://arxiv.org/abs/2504.02010) ![](https://img.shields.io/badge/arXiv-2025.04-red)
379 | * BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [[paper]](https://openai.com/index/browsecomp/) ![](https://img.shields.io/badge/OpenAI-2025.04-red)
380 | * Mle-bench: Evaluating machine learning agents on machine learning engineering [[paper]](https://arxiv.org/abs/2410.07095) ![](https://img.shields.io/badge/arXiv-2024.10-red)
381 | * How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [[paper]](https://arxiv.org/abs/2403.11807) ![](https://img.shields.io/badge/arXiv-2024.03-red)
382 | * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [[paper]](https://openreview.net/forum?id=tN61DTr4Ed) ![](https://img.shields.io/badge/NeurIPS-2024.09-blue)
383 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [[paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red)
384 | * Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [[paper]](https://arxiv.org/abs/2501.11733) ![](https://img.shields.io/badge/arXiv-2025.01-red)
385 | * PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [[paper]](https://arxiv.org/abs/2502.12054) ![](https://img.shields.io/badge/arXiv-2025.02-red)
386 | * Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [[paper]](https://arxiv.org/abs/2502.13092) ![](https://img.shields.io/badge/arXiv-2025.02-red)
387 | * WebGames: Challenging General-Purpose Web-Browsing AI Agents [[paper]](https://arxiv.org/abs/2502.18356) ![](https://img.shields.io/badge/arXiv-2025.02-red)
388 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.21620) ![](https://img.shields.io/badge/arXiv-2025.03-red)
389 | * Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [[paper]](https://openreview.net/forum?id=HjwK-Tc_Bc) ![](https://img.shields.io/badge/NeurIPS-2022.11-blue)
390 | * Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots [[paper]](https://arxiv.org/abs/2405.07990) ![](https://img.shields.io/badge/arXiv-2024.05-red)
391 | * M<sup>3</sup>CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [[paper]](https://aclanthology.org/2024.acl-long.446/) ![](https://img.shields.io/badge/ACL-2024.08-blue)
392 | * PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [[paper]](https://aclanthology.org/2024.findings-acl.962/) ![](https://img.shields.io/badge/ACL_findings-2024.08-blue)
393 | * Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation [[paper]](https://openreview.net/forum?id=t1mAXb4Cop) ![](https://img.shields.io/badge/NeurIPS-2024.09-blue)
394 | * HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [[paper]](https://arxiv.org/abs/2410.12381) ![](https://img.shields.io/badge/arXiv-2024.10-red)
395 | * CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [[paper]](https://arxiv.org/abs/2412.12932) ![](https://img.shields.io/badge/arXiv-2024.12-red)
396 | * ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation [[paper]](https://openreview.net/forum?id=sGpCzsfd1K) ![](https://img.shields.io/badge/ICLR-2025.01-blue)
397 | * Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios [[paper]](https://arxiv.org/abs/2502.19973) ![](https://img.shields.io/badge/arXiv-2025.02-red)
398 | * EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [[paper]](https://arxiv.org/abs/2502.08859) ![](https://img.shields.io/badge/arXiv-2025.02-red)
399 | * Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities [[paper]](https://arxiv.org/abs/2502.11829) ![](https://img.shields.io/badge/arXiv-2025.02-red)
400 | * Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [[paper]](https://arxiv.org/abs/2503.04801) ![](https://img.shields.io/badge/arXiv-2025.03-red)
401 | * MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems [[paper]](https://arxiv.org/abs/2503.01891) ![](https://img.shields.io/badge/arXiv-2025.03-red)
402 | * LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? [[paper]](https://arxiv.org/abs/2503.19990) ![](https://img.shields.io/badge/arXiv-2025.03-red)
403 | * On the measure of intelligence [[paper]](https://arxiv.org/abs/1911.01547) ![](https://img.shields.io/badge/arXiv-2019.11-red)
404 | * Competition-Level Code Generation with AlphaCode [[paper]](https://arxiv.org/abs/2203.07814) ![](https://img.shields.io/badge/arXiv-2022.03-red)
405 | * Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [[paper]](https://aclanthology.org/2023.findings-acl.824/) ![](https://img.shields.io/badge/ACL_findings-2023.07-blue)
406 | * OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [[paper]](https://openreview.net/forum?id=ayF8bEKYQy) ![](https://img.shields.io/badge/-2024.00-blue)
407 | * Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning [[paper]](https://openreview.net/forum?id=YXnwlZe0yf) ![](https://img.shields.io/badge/NeruIPS-2024.00-blue)
408 | * Let's verify step by step [[paper]](https://openreview.net/forum?id=v8L0pN6EOi) ![](https://img.shields.io/badge/ICLR-2024.01-blue)
409 | * Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation [[paper]](https://arxiv.org/abs/2405.11430) ![](https://img.shields.io/badge/arXiv-2024.05-red)
410 | * Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai [[paper]](https://arxiv.org/abs/2411.04872) ![](https://img.shields.io/badge/arXiv-2024.11-red)
411 | * LiveBench: A Challenging, Contamination-Limited LLM Benchmark [[paper]](https://openreview.net/forum?id=sKYHBTAxVa) ![](https://img.shields.io/badge/ICLR-2025.00-blue)
412 | * JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [[paper]](https://arxiv.org/abs/2501.14851) ![](https://img.shields.io/badge/arXiv-2025.01-red)
413 | * MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [[paper]](https://arxiv.org/abs/2501.18362) ![](https://img.shields.io/badge/arXiv-2025.01-red)
414 | * Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [[paper]](https://arxiv.org/abs/2502.15815) ![](https://img.shields.io/badge/arXiv-2025.02-red)
415 | * AIME 2025 [[huggingface]](https://huggingface.co/datasets/opencompass/AIME2025) ![](https://img.shields.io/badge/Huggingface-2025.02-yellow)
416 | * ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning [[paper]](https://arxiv.org/abs/2502.16268) ![](https://img.shields.io/badge/arXiv-2025.02-red)
417 | * ProBench: Benchmarking Large Language Models in Competitive Programming [[paper]](https://arxiv.org/abs/2502.20868) ![](https://img.shields.io/badge/arXiv-2025.02-red)
418 | * ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [[paper]](https://arxiv.org/abs/2502.01100) ![](https://img.shields.io/badge/arXiv-2025.02-red)
419 | * DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization [[paper]](https://openreview.net/forum?id=2Zan4ATYsh) ![](https://img.shields.io/badge/TMLR-2025.02-blue)
420 | * QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [[paper]](https://arxiv.org/abs/2503.22674) ![](https://img.shields.io/badge/arXiv-2025.03-red)
421 | * Benchmarking Reasoning Robustness in Large Language Models [[paper]](https://arxiv.org/abs/2503.04550) ![](https://img.shields.io/badge/arXiv-2025.03-red)
422 | * Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [[paper]](https://arxiv.org/abs/2502.08680) ![](https://img.shields.io/badge/arXiv-2025.02-red)
423 | * Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [[paper]](https://arxiv.org/abs/2502.12521) ![](https://img.shields.io/badge/arXiv-2025.02-red)
424 | * Rewardbench: Evaluating reward models for language modeling [[paper]](https://arxiv.org/abs/2403.13787) ![](https://img.shields.io/badge/arXiv-2024.03-red)
425 | * Evaluating LLMs at Detecting Errors in LLM Responses [[paper]](https://openreview.net/forum?id=dnwRScljXr) ![](https://img.shields.io/badge/COLM-2024.07-blue)
426 | * CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [[paper]](https://aclanthology.org/2024.findings-acl.91/) ![](https://img.shields.io/badge/COLM-2024.08-blue)
427 | * Judgebench: A benchmark for evaluating llm-based judges [[paper]](https://arxiv.org/abs/2410.12784) ![](https://img.shields.io/badge/arXiv-2024.10-red)
428 | * Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection [[paper]](https://arxiv.org/abs/2410.04509) ![](https://img.shields.io/badge/arXiv-2024.10-red)
429 | * Processbench: Identifying process errors in mathematical reasoning [[paper]](https://arxiv.org/abs/2412.06559) ![](https://img.shields.io/badge/arXiv-2024.12-red)
430 | * Medec: A benchmark for medical error detection and correction in clinical notes [[paper]](https://arxiv.org/abs/2412.19260) ![](https://img.shields.io/badge/arXiv-2024.12-red)
431 | * CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [[paper]](https://arxiv.org/abs/2502.16614) ![](https://img.shields.io/badge/arXiv-2025.02-red)
432 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[paper]](https://arxiv.org/abs/2502.19361) ![](https://img.shields.io/badge/arXiv-2025.02-red)
433 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[paper]](https://arxiv.org/abs/2502.20238) ![](https://img.shields.io/badge/arXiv-2025.02-red)
434 | 
435 | 
436 | 
437 | ## Part 11: Reasoning and Safety
438 | * Measuring Faithfulness in Chain-of-Thought Reasoning [[Blog]](https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning) ![](https://img.shields.io/badge/blog-2023.7-red)
439 | * Deliberative Alignment: Reasoning Enables Safer Language Models [[Paper]](https://arxiv.org/abs/2412.16339) ![](https://img.shields.io/badge/arXiv-2024.12-red)
440 | * OpenAI trained o1 and o3 to ‘think’ about its safety policy [[Blog]](https://techcrunch.com/2024/12/22/openai-trained-o1-and-o3-to-think-about-its-safety-policy) ![](https://img.shields.io/badge/blog-2024.12-red)
441 | * Why AI Safety Researchers Are Worried About DeepSeek [[Blog]](https://time.com/7210888/deepseeks-hidden-ai-safety-warning/) ![](https://img.shields.io/badge/blog-2025.1-red)
442 | * OverThink: Slowdown Attacks on Reasoning LLMs [[Paper]](https://arxiv.org/abs/2502.02542) ![](https://img.shields.io/badge/arXiv-2025.02-red)
443 | * GuardReasoner: Towards Reasoning-based LLM Safeguards [[Paper]](https://arxiv.org/abs/2501.18492) ![](https://img.shields.io/badge/ICLR_WorkShop-2025-blue)
444 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) ![](https://img.shields.io/badge/arXiv-2025.02-red)
445 | * ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [[Paper]](https://arxiv.org/abs/2502.13458) ![](https://img.shields.io/badge/arXiv-2025.02-red)
446 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) ![](https://img.shields.io/badge/arXiv-2025.02-red)
447 | * H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [[Paper]](https://arxiv.org/abs/2502.12893) ![](https://img.shields.io/badge/arXiv-2025.02-red)
448 | * BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [[Paper]](https://arxiv.org/abs/2502.12202) ![](https://img.shields.io/badge/arXiv-2025.02-red)
449 | * The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [[Paper]](https://arxiv.org/abs/2502.12659) ![](https://img.shields.io/badge/arXiv-2025.02-red)
450 | * Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [[Blog]](https://far.ai/post/2025-02-r1-redteaming/) ![](https://img.shields.io/badge/blog-2025.02-red)
451 | * Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [[Paper]](https://arxiv.org/abs/2503.00555) ![](https://img.shields.io/badge/arXiv-2025.03-red)
452 | * DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning [[Paper]](https://arxiv.org/abs/2504.12659) ![](https://img.shields.io/badge/arXiv-2025.04-red)
453 | * STAR-1: Safer Alignment of Reasoning LLMs with 1K Data [[Paper]](https://arxiv.org/abs/2504.01903) ![](https://img.shields.io/badge/arXiv-2025.04-red)
454 | 
455 | ## Part 12: R1 Driven Multimodal Reasoning Enhancement
456 | * Open R1 Video [[github]](https://github.com/Wang-Xiaodong1899/Open-R1-Video) ![](https://img.shields.io/badge/github-2025.02-red)
457 | * R1-Vision: Let's first take a look at the image [[github]](https://github.com/yuyq96/R1-Vision) ![](https://img.shields.io/badge/github-2025.02-red)
458 | * MedVLM-R1: Incentivizing Medical Reasoning  Capability of Vision-Language Models (VLMs)  via Reinforcement Learning [[paper]](https://arxiv.org/abs/2502.19634) ![](https://img.shields.io/badge/arXiv-2025.02-red)
459 | * Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [[github]](https://github.com/baibizhe/Efficient-R1-VLLM) ![](https://img.shields.io/badge/github-2025.03-red)
460 | * MMR1: Advancing the Frontiers of Multimodal Reasoning [[github]](https://github.com/LengSicong/MMR1) ![](https://img.shields.io/badge/github-2025.03-red)
461 | * Skywork-R1V: Pioneering Multimodal Reasoning with CoT [[github]](https://github.com/SkyworkAI/Skywork-R1V/tree/main) ![](https://img.shields.io/badge/github-2025.03-red)
462 | * VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [[Blog]](https://om-ai-lab.github.io/index.html) ![](https://img.shields.io/badge/blog-2025.03-red)
463 | * Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [[paper]](https://arxiv.org/abs/2503.24376v1) ![](https://img.shields.io/badge/arXiv-2025.03-red)
464 | * Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [[paper]](https://arxiv.org/abs/2503.20752) ![](https://img.shields.io/badge/arXiv-2025.03-red)
465 | * MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.07365) ![](https://img.shields.io/badge/arXiv-2025.03-red)
466 | * R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.05379) ![](https://img.shields.io/badge/arXiv-2025.03-red)
467 | * R1-Onevision: Advancing Generalized Multimodal Reasoning through  Cross-Modal Formalization [[paper]](https://arxiv.org/abs/2503.10615) ![](https://img.shields.io/badge/arXiv-2025.03-red)
468 | * R1-VL: Learning to Reason with Multimodal Large Language Models via  Step-wise Group Relative Policy Optimization [[paper]](https://arxiv.org/abs/2503.12937) ![](https://img.shields.io/badge/arXiv-2025.03-red)
469 | * Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [[paper]](https://arxiv.org/abs/2503.11197) ![](https://img.shields.io/badge/arXiv-2025.03-red)
470 | * Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [[paper]](https://arxiv.org/abs/2503.06520) ![](https://img.shields.io/badge/arXiv-2025.03-red)
471 | * TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [[paper]](https://arxiv.org/abs/2503.13377) ![](https://img.shields.io/badge/arXiv-2025.03-red)
472 | * Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language  Models [[paper]](https://arxiv.org/abs/2503.06749) ![](https://img.shields.io/badge/arXiv-2025.03-red)
473 | * Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [[paper]](http://arxiv.org/abs/2503.22679) ![](https://img.shields.io/badge/arXiv-2025.03-red)
474 | * Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [[paper]](http://arxiv.org/abs/2504.02587) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
475 | * VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [[paper]](http://arxiv.org/abs/2504.07615) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
476 | * VLAA-Thinking [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/blob/main/assets/VLAA-Thinker.pdf) ![](https://img.shields.io/badge/github-2025.04-red)    
477 | * SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [[paper]](http://arxiv.org/abs/2504.07934) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
478 | * Perception-R1: Pioneering Perception Policy with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.07954) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
479 | * VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.08837) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
480 | * Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.12680) ![](https://img.shields.io/badge/arXiv-2025.04-red)  
481 | * NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [[paper]](https://arxiv.org/abs/2504.13055v1) ![](https://img.shields.io/badge/arXiv-2025.04-red)
482 | 
483 | 
484 | 
485 | 
486 | ## Citation
487 | If you find this work useful, welcome to cite us.
488 | ```bib
489 | @misc{li202512surveyreasoning,
490 |       title={From System 1 to System 2: A Survey of Reasoning Large Language Models}, 
491 |       author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu},
492 |       year={2025},
493 |       eprint={2502.17419},
494 |       archivePrefix={arXiv},
495 |       primaryClass={cs.AI},
496 |       url={https://arxiv.org/abs/2502.17419}, 
497 | }
498 | ```
499 | 
500 | <!-- omit in toc -->
501 | ## ⭐ Star History
502 | 
503 | <a href="https://star-history.com/#zzli2022/Awesome-System2-Reasoning-LLM&Date">
504 |  <picture>
505 |    <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=zzli2022/Awesome-System2-Reasoning-LLM&type=Date&theme=dark" />
506 |    <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=zzli2022/Awesome-System2-Reasoning-LLM&type=Date" />
507 |    <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=zzli2022/Awesome-System2-Reasoning-LLM&type=Date" />
508 |  </picture>
509 | </a>
510 | 


--------------------------------------------------------------------------------
/assets/develope.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/develope.jpg


--------------------------------------------------------------------------------
/assets/paper.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "Part 1: O1 Replication": [
  3 |         {
  4 |             "paper": "O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?",
  5 |             "link": "https://arxiv.org/abs/2411.16489",
  6 |             "venue": "arXiv",
  7 |             "date": "2024-11",
  8 |             "label": "huang2024o1"
  9 |         },
 10 |         {
 11 |             "paper": "O1 Replication Journey: A Strategic Progress Report -- Part 1",
 12 |             "link": "https://arxiv.org/abs/2410.18982",
 13 |             "venue": "arXiv",
 14 |             "date": "2024-10",
 15 |             "label": "qin2024o1"
 16 |         },
 17 |         {
 18 |             "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions",
 19 |             "link": "https://arxiv.org/abs/2411.14405",
 20 |             "venue": "arXiv",
 21 |             "date": "2024-11",
 22 |             "label": "zhao2024marco"
 23 |         },
 24 |         {
 25 |             "paper": "o1-Coder: an o1 Replication for Coding",
 26 |             "link": "https://arxiv.org/abs/2412.00154",
 27 |             "venue": "arXiv",
 28 |             "date": "2024-12",
 29 |             "label": "zhang2024o1"
 30 |         },
 31 |         {
 32 |             "paper": "Enhancing LLM Reasoning with Reward-guided Tree Search",
 33 |             "link": "https://arxiv.org/abs/2411.11694",
 34 |             "venue": "arXiv",
 35 |             "date": "2024-11",
 36 |             "label": "chen2024enhancing"
 37 |         },
 38 |         {
 39 |             "paper": "Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems",
 40 |             "link": "https://arxiv.org/abs/2412.09413",
 41 |             "venue": "arXiv",
 42 |             "date": "2024-12",
 43 |             "label": "min2024imitate"
 44 |         }
 45 |     ],
 46 |     "Part 2: Process Reward Models": [
 47 |         {
 48 |             "paper": "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations",
 49 |             "link": "https://aclanthology.org/2024.acl-long.510/",
 50 |             "venue": "ACL",
 51 |             "date": "2024-08",
 52 |             "label": "wang2024mathshepherd"
 53 |             },
 54 |             {
 55 |             "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
 56 |             "link": "https://openreview.net/forum?id=8rcFOqEud5",
 57 |             "venue": "NeurIPS",
 58 |             "date": "2024-12",
 59 |             "label": "zhang2024restmcts"
 60 |             },
 61 |         {
 62 |             "paper": "Let's Verify Step by Step.",
 63 |             "link": "https://arxiv.org/abs/2305.20050",
 64 |             "venue": "ICLR",
 65 |             "date": "2024-05",
 66 |             "label": "lightman2023letsverify"
 67 |         },
 68 |         {
 69 |             "paper": "Making Large Language Models Better Reasoners with Step-Aware Verifier",
 70 |             "link": "https://arxiv.org/abs/2206.02336",
 71 |             "venue": "arXiv",
 72 |             "date": "2023-06",
 73 |             "label": "yuan2023stepaware"
 74 |         },
 75 |         {
 76 |             "paper": "Improve Mathematical Reasoning in Language Models by Automated Process Supervision",
 77 |             "link": "https://arxiv.org/abs/2306.05372",
 78 |             "venue": "arXiv",
 79 |             "date": "2023-06",
 80 |             "label": "chen2023automatedprocess"
 81 |         },
 82 |         {
 83 |             "paper": "OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning",
 84 |             "link": "https://aclanthology.org/2024.findings-naacl.55/",
 85 |             "venue": "ACL Findings",
 86 |             "date": "2024-08",
 87 |             "label": "liu2023ovm"
 88 |         },
 89 |         {
 90 |             "paper": "Solving Math Word Problems with Process and Outcome-Based Feedback",
 91 |             "link": "https://arxiv.org/abs/2211.14275",
 92 |             "venue": "arXiv",
 93 |             "date": "2022-11",
 94 |             "label": "zhang2023processoutcome"
 95 |         },
 96 |         {
 97 |             "paper": "AutoPSV: Automated Process-Supervised Verifier",
 98 |             "link": "https://openreview.net/forum?id=eOAPWWOGs9",
 99 |             "venue": "NeurIPS",
100 |             "date": "2024-12",
101 |             "label": "lu2024autopsv"
102 |         },
103 |         {
104 |             "paper": "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs",
105 |             "link": "https://arxiv.org/abs/2406.18629",
106 |             "venue": "arXiv",
107 |             "date": "2024-06",
108 |             "label": "chen2023stepdpo"
109 |         },
110 |         {
111 |             "paper": "ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding",
112 |             "link": "https://arxiv.org/abs/2501.07861",
113 |             "venue": "arXiv",
114 |             "date": "2025-01",
115 |             "label": "li2025rearter"
116 |         },
117 |         {
118 |             "paper": "The Lessons of Developing Process Reward Models in Mathematical Reasoning.",
119 |             "link": "https://arxiv.org/abs/2501.07301",
120 |             "venue": "arXiv",
121 |             "date": "2025-01",
122 |             "label": "gao2025lessons"
123 |         },
124 |         {
125 |             "paper": "Outcome-Refining Process Supervision for Code Generation",
126 |             "link": "https://arxiv.org/abs/2412.15118",
127 |             "venue": "arXiv",
128 |             "date": "2024-12",
129 |             "label": "chen2024outcomerefining"
130 |         },
131 |         {
132 |             "paper": "Free Process Rewards without Process Labels.",
133 |             "link": "https://arxiv.org/abs/2412.01981",
134 |             "venue": "arXiv",
135 |             "date": "2024-12",
136 |             "label": "yuan2024freeprocess"
137 |         },
138 |         {
139 |             "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.",
140 |             "link": "https://arxiv.org/abs/2501.03124",
141 |             "venue": "arXiv",
142 |             "date": "2025-01",
143 |             "label": "liu2025prmbench"
144 |         },
145 |         {
146 |             "paper": "ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.",
147 |             "link": "https://arxiv.org/abs/2501.01290",
148 |             "venue": "arXiv",
149 |             "date": "2025-01",
150 |             "label": "zhang2024toolcomp"
151 |         }
152 |     ],
153 |     "Part 3: Reinforcement Learning": [
154 |         {
155 |             "paper": "Offline Reinforcement Learning for LLM Multi-Step Reasoning",
156 |             "link": "https://arxiv.org/abs/2412.16145",
157 |             "venue": "arXiv",
158 |             "date": "2024-12",
159 |             "label": "wang2024offline"
160 |         },
161 |         {
162 |             "paper": "ReFT: Representation Finetuning for Language Models",
163 |             "link": "https://aclanthology.org/2024.acl-long.410.pdf",
164 |             "venue": "ACL",
165 |             "date": "2024-08",
166 |             "label": "wu2024reft"
167 |         },
168 |         {
169 |             "paper": "Deepseekmath: Pushing the limits of mathematical reasoning in open language models",
170 |             "link": "https://arxiv.org/abs/2402.03300",
171 |             "venue": "arXiv",
172 |             "date": "2024-02",
173 |             "label": "lee2024deepseekmath"
174 |         },
175 |         {
176 |             "paper": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning",
177 |             "link": "https://arxiv.org/abs/2501.12948",
178 |             "venue": "arXiv",
179 |             "date": "2025-01",
180 |             "label": "luong2025deepseekr1"
181 |         },
182 |         {
183 |             "paper": "Kimi k1.5: Scaling Reinforcement Learning with LLMs",
184 |             "link": "https://arxiv.org/abs/2501.12599",
185 |             "venue": "arXiv",
186 |             "date": "2025-01",
187 |             "label": "liu2025kimi"
188 |         },
189 |         {
190 |             "paper": "Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search",
191 |             "link": "https://arxiv.org/abs/2502.02508",
192 |             "venue": "arXiv",
193 |             "date": "2025-02",
194 |             "label": "zhang2025satori"
195 |         },
196 |         {
197 |             "paper": "Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling",
198 |             "link": "https://arxiv.org/abs/2501.11651",
199 |             "venue": "arXiv",
200 |             "date": "2025-01",
201 |             "label": "wang2025advancing"
202 |         },
203 |         {
204 |             "paper": "Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies",
205 |             "link": "https://arxiv.org/abs/2501.17030",
206 |             "venue": "arXiv",
207 |             "date": "2025-01",
208 |             "label": "chen2025aisafety"
209 |         },
210 |         {
211 |             "paper": "Does RLHF Scale? Exploring the Impacts From Data, Model, and Method",
212 |             "link": "https://arxiv.org/abs/2412.06000",
213 |             "venue": "arXiv",
214 |             "date": "2024-12",
215 |             "label": "lee2024rlhf"
216 |         }
217 |     ],
218 |     "Part 4: MCTS/Tree Search": [
219 |         {
220 |             "paper": "Reasoning with Language Model is Planning with World Model",
221 |             "link": "https://aclanthology.org/2023.emnlp-main.507/",
222 |             "venue": "EMNLP",
223 |             "date": "2023-12",
224 |             "label": "hao2023rap"
225 |         },
226 |         {
227 |             "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning",
228 |             "link": "https://arxiv.org/abs/2405.00451",
229 |             "venue": "arXiv",
230 |             "date": "2024-05",
231 |             "label": "zhou2024mctsboost"
232 |         },
233 |         {
234 |             "paper": "Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training",
235 |             "link": "https://openreview.net/forum?id=PJfc4x2jXY",
236 |             "venue": "NeurIPS WorkShop",
237 |             "date": "2023-12",
238 |             "label": "wang2023alphazero"
239 |         },
240 |         {
241 |             "paper": "Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding",
242 |             "link": "https://openreview.net/forum?id=kh9Zt2Ldmn#discussion",
243 |             "venue": "CoLM",
244 |             "date": "2024-10",
245 |             "label": "chen2024valuemcts"
246 |         },
247 |         {
248 |             "paper": "Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search",
249 |             "link": "https://arxiv.org/abs/2412.18319",
250 |             "venue": "arXiv",
251 |             "date": "2024-12",
252 |             "label": "li2024mulberry"
253 |         },
254 |         {
255 |             "paper": "Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning",
256 |             "link": "https://arxiv.org/abs/2412.17397",
257 |             "venue": "arXiv",
258 |             "date": "2024-12",
259 |             "label": "liu2024intrinsicmcts"
260 |         },
261 |         {
262 |             "paper": "Proposing and solving olympiad geometry with guided tree search",
263 |             "link": "https://arxiv.org/abs/2412.10673",
264 |             "venue": "arXiv",
265 |             "date": "2024-12",
266 |             "label": "zhang2024geometrymcts"
267 |         },
268 |         {
269 |             "paper": "SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models",
270 |             "link": "https://arxiv.org/abs/2412.11605",
271 |             "venue": "arXiv",
272 |             "date": "2024-12",
273 |             "label": "wang2024spar"
274 |         },
275 |         {
276 |             "paper": "Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning",
277 |             "link": "https://arxiv.org/abs/2412.09078",
278 |             "venue": "arXiv",
279 |             "date": "2024-12",
280 |             "label": "xu2024forest"
281 |         },
282 |         {
283 |             "paper": "SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation",
284 |             "link": "https://arxiv.org/abs/2411.11053",
285 |             "venue": "arXiv",
286 |             "date": "2024-11",
287 |             "label": "liu2024sramcts"
288 |         },
289 |         {
290 |             "paper": "MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree",
291 |             "link": "https://arxiv.org/abs/2411.15645",
292 |             "venue": "arXiv",
293 |             "date": "2024-11",
294 |             "label": "wang2024mcnest"
295 |         },
296 |         {
297 |             "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions",
298 |             "link": "https://arxiv.org/abs/2411.14405",
299 |             "venue": "arXiv",
300 |             "date": "2024-11",
301 |             "label": "zhang2024marcoo1"
302 |         },
303 |         {
304 |             "paper": "GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection",
305 |             "link": "https://arxiv.org/abs/2411.04459",
306 |             "venue": "arXiv",
307 |             "date": "2024-11",
308 |             "label": "chen2024gptmcts"
309 |         },
310 |         {
311 |             "paper": "CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models",
312 |             "link": "https://arxiv.org/abs/2411.04329",
313 |             "venue": "arXiv",
314 |             "date": "2024-11",
315 |             "label": "liu2024codetree"
316 |         },
317 |         {
318 |             "paper": "Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination",
319 |             "link": "https://arxiv.org/abs/2410.17820",
320 |             "venue": "arXiv",
321 |             "date": "2024-10",
322 |             "label": "wang2024treeofthoughts"
323 |         },
324 |         {
325 |             "paper": "TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling",
326 |             "link": "https://arxiv.org/abs/2410.16033",
327 |             "venue": "arXiv",
328 |             "date": "2024-10",
329 |             "label": "zhou2024treebon"
330 |         },
331 |         {
332 |             "paper": "Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning",
333 |             "link": "https://arxiv.org/abs/2410.06508",
334 |             "venue": "arXiv",
335 |             "date": "2024-10",
336 |             "label": "li2024selfimprove"
337 |         },
338 |         {
339 |             "paper": "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning",
340 |             "link": "https://arxiv.org/abs/2410.02884",
341 |             "venue": "arXiv",
342 |             "date": "2024-10",
343 |             "label": "yang2024llamaberry"
344 |         },
345 |         {
346 |             "paper": "Interpretable Contrastive Monte Carlo Tree Search Reasoning",
347 |             "link": "https://arxiv.org/abs/2410.01707",
348 |             "venue": "arXiv",
349 |             "date": "2024-10",
350 |             "label": "hu2024interpretablemcts"
351 |         },
352 |         {
353 |             "paper": "MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time",
354 |             "link": "https://arxiv.org/abs/2405.16265",
355 |             "venue": "arXiv",
356 |             "date": "2024-05",
357 |             "label": "zhang2024mindstar"
358 |         },
359 |         {
360 |             "paper": "RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation",
361 |             "link": "https://arxiv.org/abs/2409.09584",
362 |             "venue": "arXiv",
363 |             "date": "2024-09",
364 |             "label": "li2024rethinkmcts"
365 |         },
366 |         {
367 |             "paper": "Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search",
368 |             "link": "https://arxiv.org/abs/2408.10635",
369 |             "venue": "arXiv",
370 |             "date": "2024-08",
371 |             "label": "zhou2024strategist"
372 |         },
373 |         {
374 |             "paper": "Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search",
375 |             "link": "https://arxiv.org/abs/2405.15383",
376 |             "venue": "arXiv",
377 |             "date": "2024-05",
378 |             "label": "chen2024codeworldmcts"
379 |         },
380 |         {
381 |             "paper": "Uncertainty-Guided Optimization on Large Language Model Search Trees",
382 |             "link": "https://arxiv.org/abs/2407.03951",
383 |             "venue": "arXiv",
384 |             "date": "2024-07",
385 |             "label": "yang2024uncertaintymcts"
386 |         },
387 |         {
388 |             "paper": "Tree Search for Language Model Agents",
389 |             "link": "https://arxiv.org/abs/2407.01476",
390 |             "venue": "arXiv",
391 |             "date": "2024-07",
392 |             "label": "wu2024treesearchlm"
393 |         },
394 |         {
395 |             "paper": "LiteSearch: Efficacious Tree Search for LLM",
396 |             "link": "https://arxiv.org/abs/2407.00320",
397 |             "venue": "arXiv",
398 |             "date": "2024-07",
399 |             "label": "chen2024litesearch"
400 |         },
401 |         {
402 |             "paper": "Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B",
403 |             "link": "https://arxiv.org/abs/2406.07394",
404 |             "venue": "arXiv",
405 |             "date": "2024-06",
406 |             "label": "liu2024accessing"
407 |         },
408 |         {
409 |             "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
410 |             "link": "https://arxiv.org/abs/2406.03816",
411 |             "venue": "NeurIPS",
412 |             "date": "2024-12",
413 |             "label": "wang2024restmcts"
414 |         },
415 |         {
416 |             "paper": "On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes",
417 |             "link": "https://ieeexplore.ieee.org/abstract/document/10870057/",
418 |             "venue": "IEEE TAC",
419 |             "date": "2025-01",
420 |             "label": "zhang2025mcts"
421 |         },
422 |         {
423 |             "paper": "AlphaMath Almost Zero: process Supervision without process",
424 |             "link": "https://arxiv.org/abs/2405.03553",
425 |             "venue": "arXiv",
426 |             "date": "2024-05",
427 |             "label": "chen2024alphamath"
428 |         },
429 |         {
430 |             "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning",
431 |             "link": "https://arxiv.org/abs/2405.00451",
432 |             "venue": "arXiv",
433 |             "date": "2024-05",
434 |             "label": "liu2024mctsboost"
435 |         },
436 |         {
437 |             "paper": "Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping",
438 |             "link": "https://openreview.net/forum?id=rviGTsl0oy",
439 |             "venue": "ICLR WorkShop",
440 |             "date": "2024-05",
441 |             "label": "wang2024beyonda"
442 |         },
443 |         {
444 |             "paper": "Stream of Search (SoS): Learning to Search in Language",
445 |             "link": "https://arxiv.org/abs/2404.03683",
446 |             "venue": "arXiv",
447 |             "date": "2024-04",
448 |             "label": "yang2024sos"
449 |         },
450 |         {
451 |             "paper": "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models",
452 |             "link": "https://openreview.net/forum?id=h1mvwbQiXR",
453 |             "venue": "ICLR WorkShop",
454 |             "date": "2024-05",
455 |             "label": "zhang2024llmreasoners"
456 |         },
457 |         {
458 |             "paper": "Search-o1: Agentic Search-Enhanced Large Reasoning Models",
459 |             "link": "https://arxiv.org/abs/2501.05366",
460 |             "venue": "arXiv",
461 |             "date": "2025-01",
462 |             "label": "li2025searcho1"
463 |         },
464 |         {
465 |             "paper": "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking",
466 |             "link": "https://arxiv.org/abs/2501.04519",
467 |             "venue": "arXiv",
468 |             "date": "2025-01",
469 |             "label": "chen2025rstar"
470 |         },
471 |         {
472 |             "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs",
473 |             "link": "https://arxiv.org/abs/2412.18925",
474 |             "venue": "arXiv",
475 |             "date": "2024-12",
476 |             "label": "zhang2024huatuo"
477 |         },
478 |         {
479 |             "paper": "AFlow: Automating Agentic Workflow Generation",
480 |             "link": "https://arxiv.org/abs/2410.10762",
481 |             "venue": "arXiv",
482 |             "date": "2024-10",
483 |             "label": "wang2024aflow"
484 |         },
485 |         {
486 |             "paper": "MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING",
487 |             "link": "https://arxiv.org/abs/2309.15028",
488 |             "venue": "arXiv",
489 |             "date": "2023-09",
490 |             "label": "liu2023ppo"
491 |         },
492 |         {
493 |             "paper": "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning",
494 |             "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html",
495 |             "venue": "NeurIPS",
496 |             "date": "2023-12",
497 |             "label": "wang2023commonsense"
498 |         },
499 |         {
500 |             "paper": "ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING",
501 |             "link": "https://openreview.net/forum?id=PJfc4x2jXY",
502 |             "venue": "NeurIPS WorkShop",
503 |             "date": "2023-12",
504 |             "label": "li2023alphazero"
505 |         },
506 |         {
507 |             "paper": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing",
508 |             "link": "https://arxiv.org/abs/2404.12253",
509 |             "venue": "arXiv",
510 |             "date": "2024-04",
511 |             "label": "chen2024selfimprove"
512 |         }
513 |     ],
514 |     "Part 5: Self-Training / Self-Improve": [
515 |         {
516 |             "paper": "STaR: Bootstrapping Reasoning With Reasoning",
517 |             "link": "https://arxiv.org/abs/2203.14465",
518 |             "venue": "NeurIPS2022",
519 |             "date": "2022-05",
520 |             "label": "zelikman2022star"
521 |         },
522 |         {
523 |             "paper": "ReST: Reinforced Self-Training for Language Modeling",
524 |             "link": "https://arxiv.org/abs/2308.08998",
525 |             "venue": "arXiv",
526 |             "date": "2023-08",
527 |             "label": "gulcehre2023rest"
528 |         },
529 |         {
530 |             "paper": "ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models",
531 |             "link": "https://openreview.net/forum?id=lNAyUngGFK",
532 |             "venue": "TMLR",
533 |             "date": "2024-09",
534 |             "label": "yang2024restem"
535 |         },
536 |         {
537 |             "paper": "Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search",
538 |             "link": "https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html",
539 |             "venue": "NeurIPS",
540 |             "date": "2017-12",
541 |             "label": "anthony2017expert"
542 |         },
543 |         {
544 |             "paper": "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking",
545 |             "link": "https://arxiv.org/abs/2403.09629",
546 |             "venue": "arXiv",
547 |             "date": "2024-03",
548 |             "label": "zelikman2024quietstar"
549 |         },
550 |         {
551 |             "paper": "V-star: Training Verifiers for Self-Taught Reasoners",
552 |             "link": "https://arxiv.org/abs/2402.06457",
553 |             "venue": "arXiv",
554 |             "date": "2024-02",
555 |             "label": "liu2024vstar"
556 |         },
557 |         {
558 |             "paper": "Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models",
559 |             "link": "https://arxiv.org/abs/2406.11736",
560 |             "venue": "arXiv",
561 |             "date": "2024-06",
562 |             "label": "wang2024interactive"
563 |         },
564 |         {
565 |             "paper": "ReFT: Representation Finetuning for Language Models",
566 |             "link": "https://aclanthology.org/2024.acl-long.410.pdf",
567 |             "venue": "ACL",
568 |             "date": "2024-08",
569 |             "label": "wu2024reft"
570 |         },
571 |         {
572 |             "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
573 |             "link": "https://arxiv.org/abs/2406.03816",
574 |             "venue": "NeurIPS",
575 |             "date": "2024-12",
576 |             "label": "chen2024restmcts"
577 |         },
578 |         {
579 |             "paper": "Recursive Introspection: Teaching Language Model Agents How to Self-Improve",
580 |             "link": "https://openreview.net/forum?id=DRC9pZwBwR",
581 |             "venue": "NeurIPS",
582 |             "date": "2024-12",
583 |             "label": "zhang2024recursive"
584 |         },
585 |         {
586 |             "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner",
587 |             "link": "https://arxiv.org/abs/2412.17256",
588 |             "venue": "arXiv",
589 |             "date": "2024-12",
590 |             "label": "he2024bstar"
591 |         },
592 |         {
593 |             "paper": "Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)",
594 |             "link": "https://arxiv.org/abs/2501.04519",
595 |             "venue": "arXiv",
596 |             "date": "2025-01",
597 |             "label": "xu2025rstar"
598 |         },
599 |         {
600 |             "paper": "Enhancing Large Vision Language Models with Self-Training on Image Comprehension",
601 |             "link": "https://arxiv.org/abs/2405.19716",
602 |             "venue": "arXiv",
603 |             "date": "2024-05",
604 |             "label": "li2024enhancing"
605 |         },
606 |         {
607 |             "paper": "Self-Refine: Iterative Refinement with Self-Feedback",
608 |             "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html",
609 |             "venue": "NeurIPS",
610 |             "date": "2023-12",
611 |             "label": "madaan2023selfrefine"
612 |         },
613 |         {
614 |             "paper": "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing",
615 |             "link": "https://openreview.net/forum?id=Sx038qxjek",
616 |             "venue": "ICLR",
617 |             "date": "2024-05",
618 |             "label": "zhang2024critic"
619 |         }
620 |     ],
621 |     "Part 6: Reflection": [
622 |         {
623 |             "paper": "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers",
624 |             "link": "https://arxiv.org/abs/2408.06195",
625 |             "venue": "arXiv",
626 |             "date": "2024-08",
627 |             "label": "zheng2024mutual"
628 |         },
629 |         {
630 |             "paper": "Reflection-Tuning: An Approach for Data Recycling",
631 |             "link": "https://arxiv.org/abs/2310.11716",
632 |             "venue": "arXiv",
633 |             "date": "2023-10",
634 |             "label": "li2024reflection"
635 |         },
636 |         {
637 |             "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection",
638 |             "link": "https://arxiv.org/abs/2411.00855",
639 |             "venue": "arXiv",
640 |             "date": "2024-11",
641 |             "label": "cheng2024vision"
642 |         },
643 |         {
644 |             "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs",
645 |             "link": "https://arxiv.org/abs/2412.18925",
646 |             "venue": "arXiv",
647 |             "date": "2024-12",
648 |             "label": "zhang2024huatuo"
649 |         },
650 |         {
651 |             "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning",
652 |             "link": "https://arxiv.org/abs/2411.11930",
653 |             "venue": "arXiv",
654 |             "date": "2024-11",
655 |             "label": "liu2024atomthink"
656 |         },
657 |         {
658 |             "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step",
659 |             "link": "https://arxiv.org/abs/2411.10440",
660 |             "venue": "arXiv",
661 |             "date": "2024-11",
662 |             "label": "xu2024llava"
663 |         }
664 |     ],
665 |     "Part 7: Efficient System2": [
666 |         {
667 |         "paper": "Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking",
668 |         "link": "https://arxiv.org/abs/2501.01306",
669 |         "venue": "arXiv",
670 |         "date": "2025-01",
671 |         "label": "cheng2025think"
672 |         },
673 |         {
674 |         "paper": "Token-Budget-Aware LLM Reasoning",
675 |         "link": "https://arxiv.org/abs/2412.18547",
676 |         "venue": "arXiv",
677 |         "date": "2024-12",
678 |         "label": "wang2024token"
679 |         },
680 |         {
681 |         "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner",
682 |         "link": "https://arxiv.org/abs/2412.17256",
683 |         "venue": "arXiv",
684 |         "date": "2024-12",
685 |         "label": "he2024bstar"
686 |         },
687 |         {
688 |         "paper": "Guiding Language Model Reasoning with Planning Tokens",
689 |         "link": "https://arxiv.org/abs/2310.05707",
690 |         "venue": "CoLM",
691 |         "date": "2024-10",
692 |         "label": "wang2023guiding"
693 |         },
694 |         {
695 |         "paper": "DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models",
696 |         "link": "https://arxiv.org/abs/2407.01009",
697 |         "venue": "EMNLP",
698 |         "date": "2024-12",
699 |         "label": "li2024dynathink"
700 |         },
701 |         {
702 |         "paper": "Training Large Language Models to Reason in a Continuous Latent Space",
703 |         "link": "https://arxiv.org/abs/2412.06769",
704 |         "venue": "arXiv",
705 |         "date": "2024-12",
706 |         "label": "chen2024training"
707 |         },
708 |         {
709 |         "paper": "O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning",
710 |         "link": "https://arxiv.org/abs/2501.12570",
711 |         "venue": "arXiv",
712 |         "date": "2025-01",
713 |         "label": "zhang2025o1pruner"
714 |         }
715 |     ],
716 |     "Part 8: Explainability": [
717 |         {
718 |             "paper": "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective",
719 |             "link": "https://arxiv.org/abs/2410.23743",
720 |             "venue": "arXiv",
721 |             "date": "2024-10",
722 |             "label": "li2024whathappened"
723 |         },
724 |         {
725 |             "paper": "When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1",
726 |             "link": "https://arxiv.org/abs/2410.01792",
727 |             "venue": "arXiv",
728 |             "date": "2024-10",
729 |             "label": "zhang2024embers"
730 |         },
731 |         {
732 |             "paper": "Agents Thinking Fast and Slow: A Talker-Reasoner Architecture",
733 |             "link": "https://openreview.net/forum?id=xPhcP6rbI4",
734 |             "venue": "NeurIPS WorkShop",
735 |             "date": "2024-12",
736 |             "label": "wang2024agents"
737 |         },
738 |         {
739 |             "paper": "System 2 Attention (is something you might need too)",
740 |             "link": "https://arxiv.org/abs/2311.11829",
741 |             "venue": "arXiv",
742 |             "date": "2023-11",
743 |             "label": "chen2023system2"
744 |         },
745 |         {
746 |             "paper": "Distilling System 2 into System 1",
747 |             "link": "https://arxiv.org/abs/2407.06023",
748 |             "venue": "arXiv",
749 |             "date": "2024-07",
750 |             "label": "liu2024distilling"
751 |         },
752 |         {
753 |             "paper": "The Impact of Reasoning Step Length on Large Language Models",
754 |             "link": "https://arxiv.org/abs/2401.04925",
755 |             "venue": "ACL Findings",
756 |             "date": "2024-08",
757 |             "label": "sun2024impact"
758 |         }
759 |     ],
760 |     "Part 9: Multimodal Agent related Slow-Fast System": [
761 |         {
762 |             "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning",
763 |             "link": "https://arxiv.org/abs/2411.11930",
764 |             "venue": "arXiv",
765 |             "date": "2024-11",
766 |             "label": "liu2024atomthink"
767 |         },
768 |         {
769 |             "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step",
770 |             "link": "https://arxiv.org/abs/2411.10440",
771 |             "venue": "arXiv",
772 |             "date": "2024-11",
773 |             "label": "xu2024llava"
774 |         },
775 |         {
776 |             "paper": "Visual Agents as Fast and Slow Thinkers",
777 |             "link": "https://openreview.net/forum?id=ncCuiD3KJQ",
778 |             "venue": "ICLR",
779 |             "date": "2025-01",
780 |             "label": "gao2025visualagents"
781 |         },
782 |         {
783 |             "paper": "Slow Perception: Let's Perceive Geometric Figures Step-by-Step",
784 |             "link": "https://arxiv.org/abs/2412.20631",
785 |             "venue": "arXiv",
786 |             "date": "2024-12",
787 |             "label": "wei2024slow"
788 |         },
789 |         {
790 |             "paper": "Virgo: A Preliminary Exploration on Reproducing o1-like MLLM",
791 |             "link": "https://arxiv.org/abs/2501.01904",
792 |             "venue": "arXiv",
793 |             "date": "2025-01",
794 |             "label": "du2025virgo"
795 |         },
796 |         {
797 |             "paper": "Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension",
798 |             "link": "https://arxiv.org/pdf/2412.03704",
799 |             "venue": "arXiv",
800 |             "date": "2024-12",
801 |             "label": "feng2024scaling"
802 |         },
803 |         {
804 |             "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection",
805 |             "link": "https://arxiv.org/abs/2411.00855",
806 |             "venue": "arXiv",
807 |             "date": "2024-11",
808 |             "label": "cheng2024vision"
809 |         },
810 |         {
811 |             "paper": "Diving into Self-Evolving Training for Multimodal Reasoning",
812 |             "link": "https://arxiv.org/abs/2412.17451",
813 |             "venue": "ICLR",
814 |             "date": "2025-01",
815 |             "label": "zhao2024selfevolving"
816 |         }
817 |     ],
818 |     "Part 10: Benchmark and Datasets": [
819 |         {
820 |             "paper": "A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?",
821 |             "link": "https://arxiv.org/abs/2409.15277",
822 |             "venue": "arXiv",
823 |             "date": "2024-09",
824 |             "label": "tu2024preliminary"
825 |         },
826 |         {
827 |             "paper": "MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs",
828 |             "link": "https://openreview.net/forum?id=GN2qbxZlni",
829 |             "venue": "NeurIPS",
830 |             "date": "2024-12",
831 |             "label": "li2024mrben"
832 |         },
833 |         {
834 |             "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models",
835 |             "link": "https://arxiv.org/abs/2501.03124",
836 |             "venue": "arXiv",
837 |             "date": "2025-01",
838 |             "label": "song2025prmbench"
839 |         },
840 |         {
841 |             "paper": "Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs",
842 |             "link": "https://arxiv.org/abs/2412.21187",
843 |             "venue": "arXiv",
844 |             "date": "2024-12",
845 |             "label": "huang2024overthinking"
846 |         }
847 |     ]
848 | }


--------------------------------------------------------------------------------
/assets/timeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.jpg


--------------------------------------------------------------------------------
/assets/timeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.png


--------------------------------------------------------------------------------
/assets/timeline_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline_2.png


--------------------------------------------------------------------------------
/src/list.md:
--------------------------------------------------------------------------------
  1 | ## Part 1: O1 Replication
  2 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) ![](https://img.shields.io/badge/arXiv-2024.12-red)
  3 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) ![](https://img.shields.io/badge/arXiv-2024.12-red)
  4 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) ![](https://img.shields.io/badge/arXiv-2024.11-red)
  5 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red)
  6 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) ![](https://img.shields.io/badge/arXiv-2024.11-red)
  7 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) ![](https://img.shields.io/badge/arXiv-2024.10-red)
  8 | ## Part 2: Process Reward Models
  9 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 10 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 11 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 12 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 13 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
 14 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
 15 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 16 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 17 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) ![](https://img.shields.io/badge/ACL-2024-blue)
 18 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) ![](https://img.shields.io/badge/ACL_Findings-2024-blue)
 19 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) ![](https://img.shields.io/badge/arXiv-2024.06-red)
 20 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 21 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) ![](https://img.shields.io/badge/arXiv-2023.06-red)
 22 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) ![](https://img.shields.io/badge/arXiv-2023.06-red)
 23 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) ![](https://img.shields.io/badge/arXiv-2022.11-red)
 24 | ## Part 3: Reinforcement Learning
 25 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) ![](https://img.shields.io/badge/arXiv-2025.02-red)
 26 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 27 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 28 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 29 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 30 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 31 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 32 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue)
 33 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) ![](https://img.shields.io/badge/arXiv-2024.02-red)
 34 | ## Part 4: MCTS/Tree Search
 35 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) ![](https://img.shields.io/badge/IEEE_TAC-2025-blue)
 36 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 37 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 38 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 39 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 40 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 41 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 42 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 43 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 44 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 45 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 46 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 47 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 48 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 49 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 50 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue)
 51 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 52 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 53 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 54 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 55 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 56 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) ![](https://img.shields.io/badge/arXiv-2024.10-red)
 57 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) ![](https://img.shields.io/badge/arXiv-2024.09-red)
 58 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) ![](https://img.shields.io/badge/arXiv-2024.08-red)
 59 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) ![](https://img.shields.io/badge/arXiv-2024.07-red)
 60 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) ![](https://img.shields.io/badge/arXiv-2024.07-red)
 61 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) ![](https://img.shields.io/badge/arXiv-2024.07-red)
 62 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) ![](https://img.shields.io/badge/arXiv-2024.06-red)
 63 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
 64 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue)
 65 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 66 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 67 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 68 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 69 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 70 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) ![](https://img.shields.io/badge/arXiv-2024.04-red)
 71 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) ![](https://img.shields.io/badge/arXiv-2024.04-red)
 72 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) ![](https://img.shields.io/badge/EMNLP-2023-blue)
 73 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
 74 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue)
 75 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue)
 76 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) ![](https://img.shields.io/badge/arXiv-2023.09-red)
 77 | ## Part 5: Self-Training / Self-Improve
 78 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red)
 79 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 80 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
 81 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 82 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) ![](https://img.shields.io/badge/TMLR-2024-blue)
 83 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue)
 84 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) ![](https://img.shields.io/badge/arXiv-2024.06-red)
 85 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR-2024-blue)
 86 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) ![](https://img.shields.io/badge/arXiv-2024.05-red)
 87 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) ![](https://img.shields.io/badge/arXiv-2024.03-red)
 88 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) ![](https://img.shields.io/badge/arXiv-2024.02-red)
 89 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue)
 90 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) ![](https://img.shields.io/badge/arXiv-2023.08-red)
 91 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) ![](https://img.shields.io/badge/arXiv-2022.05-red)
 92 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) ![](https://img.shields.io/badge/NeurIPS-2017-blue)
 93 | ## Part 6: Reflection
 94 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red)
 95 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 96 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 97 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red)
 98 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) ![](https://img.shields.io/badge/arXiv-2024.08-red)
 99 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) ![](https://img.shields.io/badge/arXiv-2023.10-red)
100 | ## Part 7: Efficient System2
101 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) ![](https://img.shields.io/badge/arXiv-2025.01-red)
102 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) ![](https://img.shields.io/badge/arXiv-2025.01-red)
103 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) ![](https://img.shields.io/badge/arXiv-2024.12-red)
104 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red)
105 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) ![](https://img.shields.io/badge/arXiv-2024.12-red)
106 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) ![](https://img.shields.io/badge/arXiv-2024.12-red)
107 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) ![](https://img.shields.io/badge/arXiv-2024.10-red)
108 | ## Part 8: Explainability
109 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2024-blue)
110 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) ![](https://img.shields.io/badge/arXiv-2024.10-red)
111 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) ![](https://img.shields.io/badge/arXiv-2024.10-red)
112 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) ![](https://img.shields.io/badge/arXiv-2024.08-red)
113 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) ![](https://img.shields.io/badge/arXiv-2024.07-red)
114 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) ![](https://img.shields.io/badge/arXiv-2023.11-red)
115 | ## Part 9: Multimodal Agent related Slow-Fast System
116 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) ![](https://img.shields.io/badge/arXiv-2025.01-red)
117 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) ![](https://img.shields.io/badge/ICLR-2025-blue)
118 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red)
119 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) ![](https://img.shields.io/badge/arXiv-2024.12-red)
120 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) ![](https://img.shields.io/badge/arXiv-2024.12-red)
121 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red)
122 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red)
123 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red)
124 | ## Part 10: Benchmark and Datasets
125 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red)
126 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) ![](https://img.shields.io/badge/NeurIPS-2024-blue)
127 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) ![](https://img.shields.io/badge/arXiv-2024.12-red)
128 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) ![](https://img.shields.io/badge/arXiv-2024.09-red)
129 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | from typing import Optional
  4 | 
  5 | 
  6 | class PaperInformation:
  7 |     def __init__(
  8 |         self, paper: str, link: str, venue: str, date: str, label: Optional[str] = None
  9 |     ):
 10 |         self.paper = paper
 11 |         self.link = link
 12 |         self.venue = venue
 13 |         self.date = date
 14 |         self.label = label
 15 | 
 16 |     def __hash__(self):
 17 |         return self.label
 18 | 
 19 |     def __lt__(self, other: "PaperInformation"):
 20 |         self_year, self_month = map(int, self.date.split("-"))
 21 |         other_year, other_month = map(int, other.date.split("-"))
 22 | 
 23 |         if self_year != other_year:
 24 |             return self_year > other_year  # Reverse logic to sort DESCENDING
 25 |         if self_month != other_month:
 26 |             return self_month > other_month  # Reverse logic to sort DESCENDING
 27 |         if self.venue != other.venue:
 28 |             return self.venue < other.venue  # Sort venues in descending order
 29 |         return self.paper < other.paper  # Sort titles in descending order
 30 | 
 31 |     def __eq__(self, other: "PaperInformation"):
 32 |         return self.label == other.label
 33 | 
 34 | 
 35 | class Utility:
 36 |     @staticmethod
 37 |     def get_paper_information(raw_paper_dict: dict) -> list[PaperInformation]:
 38 |         paper_information_list = []
 39 |         for key, value in raw_paper_dict.items():
 40 |             if isinstance(value, dict):
 41 |                 paper_information_list.extend(Utility.get_paper_information(value))
 42 |             elif isinstance(value, list):
 43 |                 for raw_paper_information in value:
 44 |                     if len(raw_paper_information.keys()) == 1:
 45 |                         assert "label" in raw_paper_information.keys()
 46 |                     else:
 47 |                         paper_label = raw_paper_information.get("label", None)
 48 |                         paper_information = PaperInformation(
 49 |                             paper=raw_paper_information["paper"],
 50 |                             link=raw_paper_information["link"],
 51 |                             venue=raw_paper_information["venue"],
 52 |                             date=raw_paper_information["date"],
 53 |                             label=paper_label,
 54 |                         )
 55 |                         assert (
 56 |                             paper_label is not None
 57 |                             or paper_information not in paper_information_list
 58 |                         )
 59 |                         paper_information_list.append(paper_information)
 60 |             else:
 61 |                 raise TypeError(f"Unexpected type: {type(value)}")
 62 |         return paper_information_list
 63 | 
 64 |     @staticmethod
 65 |     def fill_paper_dict(
 66 |         raw_paper_dict: dict, paper_information_list: list[PaperInformation]
 67 |     ) -> dict:
 68 |         processed_paper_dict = {}
 69 |         for key, value in raw_paper_dict.items():
 70 |             if isinstance(value, dict):
 71 |                 processed_paper_dict[key] = Utility.fill_paper_dict(
 72 |                     value, paper_information_list
 73 |                 )
 74 |             elif isinstance(value, list):
 75 |                 processed_paper_dict[key] = []
 76 |                 for raw_paper_information in value:
 77 |                     if (
 78 |                         len(raw_paper_information.keys()) == 1
 79 |                         or "label" in raw_paper_information.keys()
 80 |                     ):
 81 |                         paper_label = raw_paper_information["label"]
 82 |                         for paper_information in paper_information_list:
 83 |                             if paper_information.label == paper_label:
 84 |                                 break
 85 |                         else:
 86 |                             raise ValueError(f"Paper label not found: {paper_label}")
 87 |                         processed_paper_dict[key].append(paper_information)
 88 |                     else:
 89 |                         processed_paper_dict[key].append(
 90 |                             PaperInformation(
 91 |                                 paper=raw_paper_information["paper"],
 92 |                                 link=raw_paper_information["link"],
 93 |                                 venue=raw_paper_information["venue"],
 94 |                                 date=raw_paper_information["date"],
 95 |                             )
 96 |                         )
 97 |             else:
 98 |                 raise TypeError(f"Unexpected type: {type(value)}")
 99 |         return processed_paper_dict
100 | 
101 |     @staticmethod
102 |     def generate_title_with_level(title: str, title_level: int) -> str:
103 |         return f"{'#' * (title_level + 2)} {title}\n"
104 | 
105 |     @staticmethod
106 |     def generate_readme_table_with_title(
107 |         title: str, title_level: int, paper_information_list: list[PaperInformation]
108 |     ) -> str:
109 |         result_str = Utility.generate_title_with_level(title, title_level)
110 |         result_str += "|Title|Venue|Date|\n"
111 |         result_str += "|:---|:---|:---|\n"
112 |         paper_information_list.sort()
113 |         for paper_information in paper_information_list:
114 |             result_str += (
115 |                 f"|[{paper_information.paper}]({paper_information.link})|"
116 |                 f"{paper_information.venue}|"
117 |                 f"{paper_information.date}|\n"
118 |             )
119 |         return result_str
120 | 
121 |     @staticmethod
122 |     def generate_all_table(
123 |         paper_dict: dict, topmost_table_level: int, current_table_str: str
124 |     ) -> str:
125 |         for key, value in paper_dict.items():
126 |             if isinstance(value, dict):
127 |                 current_table_str += Utility.generate_title_with_level(
128 |                     key, topmost_table_level
129 |                 )
130 |                 current_table_str = Utility.generate_all_table(
131 |                     value, topmost_table_level + 1, current_table_str
132 |                 )
133 |             elif isinstance(value, list):
134 |                 current_table_str += Utility.generate_readme_table_with_title(
135 |                     key, topmost_table_level, value
136 |                 )
137 |             else:
138 |                 raise TypeError(f"Unexpected type: {type(value)}")
139 |         return current_table_str
140 | 
141 |     @staticmethod
142 |     def generate_list_with_title(
143 |         title: str, title_level: int, paper_information_list: list[PaperInformation]
144 |     ) -> str:
145 |         result_str = Utility.generate_title_with_level(title, title_level)
146 |         for paper_information in paper_information_list:
147 |             badge_color = "blue"
148 |             if "arxiv" in paper_information.link.lower():
149 |                 badge_color = "red"
150 |                 badge_text = f"arXiv-{paper_information.date.replace('-', '.')}"
151 |             else:
152 |                 venue = paper_information.venue.replace(" ", "_")
153 |                 year = paper_information.date.split("-")[0]
154 |                 badge_text = f"{venue}-{year}"
155 |             result_str += (
156 |                 f"* {paper_information.paper} [[Paper]]({paper_information.link}) "
157 |                 f"![](https://img.shields.io/badge/{badge_text}-{badge_color})\n"
158 |             )
159 |         return result_str
160 | 
161 |     @staticmethod
162 |     def generate_all_list(
163 |         paper_dict: dict, topmost_list_level: int, current_list_str: str
164 |     ) -> str:
165 |         for key, value in paper_dict.items():
166 |             if isinstance(value, dict):
167 |                 current_list_str += Utility.generate_title_with_level(
168 |                     key, topmost_list_level
169 |                 )
170 |                 current_list_str = Utility.generate_all_list(
171 |                     value, topmost_list_level + 1, current_list_str
172 |                 )
173 |             elif isinstance(value, list):
174 |                 current_list_str += Utility.generate_list_with_title(
175 |                     key, topmost_list_level, value
176 |                 )
177 |             else:
178 |                 raise TypeError(f"Unexpected type: {type(value)}")
179 |         return current_list_str
180 | 
181 | 
182 | def main():
183 |     raw_paper_dict = json.load(open("./assets/paper.json", "r"))
184 |     paper_information_list = Utility.get_paper_information(raw_paper_dict)
185 |     processed_paper_dict = Utility.fill_paper_dict(
186 |         raw_paper_dict, paper_information_list
187 |     )
188 |     all_table_str = Utility.generate_all_table(processed_paper_dict, 0, "")
189 |     with open("./src/table.md", "w") as f:
190 |         f.write(all_table_str)
191 |     all_list_str = Utility.generate_all_list(processed_paper_dict, 0, "")
192 |     with open("./src/list.md", "w") as f:
193 |         f.write(all_list_str)
194 | 
195 | 
196 | if __name__ == "__main__":
197 |     main()
198 | 


--------------------------------------------------------------------------------
/src/table.md:
--------------------------------------------------------------------------------
  1 | ## Part 1: O1 Replication
  2 | |Title|Venue|Date|
  3 | |:---|:---|:---|
  4 | |[Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems](https://arxiv.org/abs/2412.09413)|arXiv|2024-12|
  5 | |[o1-Coder: an o1 Replication for Coding](https://arxiv.org/abs/2412.00154)|arXiv|2024-12|
  6 | |[Enhancing LLM Reasoning with Reward-guided Tree Search](https://arxiv.org/abs/2411.11694)|arXiv|2024-11|
  7 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11|
  8 | |[O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?](https://arxiv.org/abs/2411.16489)|arXiv|2024-11|
  9 | |[O1 Replication Journey: A Strategic Progress Report -- Part 1](https://arxiv.org/abs/2410.18982)|arXiv|2024-10|
 10 | ## Part 2: Process Reward Models
 11 | |Title|Venue|Date|
 12 | |:---|:---|:---|
 13 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.](https://arxiv.org/abs/2501.03124)|arXiv|2025-01|
 14 | |[ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding](https://arxiv.org/abs/2501.07861)|arXiv|2025-01|
 15 | |[The Lessons of Developing Process Reward Models in Mathematical Reasoning.](https://arxiv.org/abs/2501.07301)|arXiv|2025-01|
 16 | |[ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.](https://arxiv.org/abs/2501.01290)|arXiv|2025-01|
 17 | |[AutoPSV: Automated Process-Supervised Verifier](https://openreview.net/forum?id=eOAPWWOGs9)|NeurIPS|2024-12|
 18 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://openreview.net/forum?id=8rcFOqEud5)|NeurIPS|2024-12|
 19 | |[Free Process Rewards without Process Labels.](https://arxiv.org/abs/2412.01981)|arXiv|2024-12|
 20 | |[Outcome-Refining Process Supervision for Code Generation](https://arxiv.org/abs/2412.15118)|arXiv|2024-12|
 21 | |[Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https://aclanthology.org/2024.acl-long.510/)|ACL|2024-08|
 22 | |[OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning](https://aclanthology.org/2024.findings-naacl.55/)|ACL Findings|2024-08|
 23 | |[Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://arxiv.org/abs/2406.18629)|arXiv|2024-06|
 24 | |[Let's Verify Step by Step.](https://arxiv.org/abs/2305.20050)|ICLR|2024-05|
 25 | |[Improve Mathematical Reasoning in Language Models by Automated Process Supervision](https://arxiv.org/abs/2306.05372)|arXiv|2023-06|
 26 | |[Making Large Language Models Better Reasoners with Step-Aware Verifier](https://arxiv.org/abs/2206.02336)|arXiv|2023-06|
 27 | |[Solving Math Word Problems with Process and Outcome-Based Feedback](https://arxiv.org/abs/2211.14275)|arXiv|2022-11|
 28 | ## Part 3: Reinforcement Learning
 29 | |Title|Venue|Date|
 30 | |:---|:---|:---|
 31 | |[Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://arxiv.org/abs/2502.02508)|arXiv|2025-02|
 32 | |[Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://arxiv.org/abs/2501.11651)|arXiv|2025-01|
 33 | |[Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies](https://arxiv.org/abs/2501.17030)|arXiv|2025-01|
 34 | |[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)|arXiv|2025-01|
 35 | |[Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)|arXiv|2025-01|
 36 | |[Does RLHF Scale? Exploring the Impacts From Data, Model, and Method](https://arxiv.org/abs/2412.06000)|arXiv|2024-12|
 37 | |[Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)|arXiv|2024-12|
 38 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08|
 39 | |[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)|arXiv|2024-02|
 40 | ## Part 4: MCTS/Tree Search
 41 | |Title|Venue|Date|
 42 | |:---|:---|:---|
 43 | |[On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes](https://ieeexplore.ieee.org/abstract/document/10870057/)|IEEE TAC|2025-01|
 44 | |[Search-o1: Agentic Search-Enhanced Large Reasoning Models](https://arxiv.org/abs/2501.05366)|arXiv|2025-01|
 45 | |[rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://arxiv.org/abs/2501.04519)|arXiv|2025-01|
 46 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12|
 47 | |[Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning](https://arxiv.org/abs/2412.09078)|arXiv|2024-12|
 48 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12|
 49 | |[Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search](https://arxiv.org/abs/2412.18319)|arXiv|2024-12|
 50 | |[Proposing and solving olympiad geometry with guided tree search](https://arxiv.org/abs/2412.10673)|arXiv|2024-12|
 51 | |[SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models](https://arxiv.org/abs/2412.11605)|arXiv|2024-12|
 52 | |[Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2412.17397)|arXiv|2024-12|
 53 | |[CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models](https://arxiv.org/abs/2411.04329)|arXiv|2024-11|
 54 | |[GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection](https://arxiv.org/abs/2411.04459)|arXiv|2024-11|
 55 | |[MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree](https://arxiv.org/abs/2411.15645)|arXiv|2024-11|
 56 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11|
 57 | |[SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2411.11053)|arXiv|2024-11|
 58 | |[Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion)|CoLM|2024-10|
 59 | |[AFlow: Automating Agentic Workflow Generation](https://arxiv.org/abs/2410.10762)|arXiv|2024-10|
 60 | |[Interpretable Contrastive Monte Carlo Tree Search Reasoning](https://arxiv.org/abs/2410.01707)|arXiv|2024-10|
 61 | |[LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning](https://arxiv.org/abs/2410.02884)|arXiv|2024-10|
 62 | |[Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning](https://arxiv.org/abs/2410.06508)|arXiv|2024-10|
 63 | |[TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling](https://arxiv.org/abs/2410.16033)|arXiv|2024-10|
 64 | |[Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination](https://arxiv.org/abs/2410.17820)|arXiv|2024-10|
 65 | |[RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2409.09584)|arXiv|2024-09|
 66 | |[Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search](https://arxiv.org/abs/2408.10635)|arXiv|2024-08|
 67 | |[LiteSearch: Efficacious Tree Search for LLM](https://arxiv.org/abs/2407.00320)|arXiv|2024-07|
 68 | |[Tree Search for Language Model Agents](https://arxiv.org/abs/2407.01476)|arXiv|2024-07|
 69 | |[Uncertainty-Guided Optimization on Large Language Model Search Trees](https://arxiv.org/abs/2407.03951)|arXiv|2024-07|
 70 | |[Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B](https://arxiv.org/abs/2406.07394)|arXiv|2024-06|
 71 | |[Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping](https://openreview.net/forum?id=rviGTsl0oy)|ICLR WorkShop|2024-05|
 72 | |[LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models](https://openreview.net/forum?id=h1mvwbQiXR)|ICLR WorkShop|2024-05|
 73 | |[AlphaMath Almost Zero: process Supervision without process](https://arxiv.org/abs/2405.03553)|arXiv|2024-05|
 74 | |[Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search](https://arxiv.org/abs/2405.15383)|arXiv|2024-05|
 75 | |[MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time](https://arxiv.org/abs/2405.16265)|arXiv|2024-05|
 76 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05|
 77 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05|
 78 | |[Stream of Search (SoS): Learning to Search in Language](https://arxiv.org/abs/2404.03683)|arXiv|2024-04|
 79 | |[Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https://arxiv.org/abs/2404.12253)|arXiv|2024-04|
 80 | |[Reasoning with Language Model is Planning with World Model](https://aclanthology.org/2023.emnlp-main.507/)|EMNLP|2023-12|
 81 | |[Large Language Models as Commonsense Knowledge for Large-Scale Task Planning](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html)|NeurIPS|2023-12|
 82 | |[ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12|
 83 | |[Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12|
 84 | |[MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING](https://arxiv.org/abs/2309.15028)|arXiv|2023-09|
 85 | ## Part 5: Self-Training / Self-Improve
 86 | |Title|Venue|Date|
 87 | |:---|:---|:---|
 88 | |[Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)](https://arxiv.org/abs/2501.04519)|arXiv|2025-01|
 89 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12|
 90 | |[Recursive Introspection: Teaching Language Model Agents How to Self-Improve](https://openreview.net/forum?id=DRC9pZwBwR)|NeurIPS|2024-12|
 91 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12|
 92 | |[ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://openreview.net/forum?id=lNAyUngGFK)|TMLR|2024-09|
 93 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08|
 94 | |[Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models](https://arxiv.org/abs/2406.11736)|arXiv|2024-06|
 95 | |[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://openreview.net/forum?id=Sx038qxjek)|ICLR|2024-05|
 96 | |[Enhancing Large Vision Language Models with Self-Training on Image Comprehension](https://arxiv.org/abs/2405.19716)|arXiv|2024-05|
 97 | |[Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](https://arxiv.org/abs/2403.09629)|arXiv|2024-03|
 98 | |[V-star: Training Verifiers for Self-Taught Reasoners](https://arxiv.org/abs/2402.06457)|arXiv|2024-02|
 99 | |[Self-Refine: Iterative Refinement with Self-Feedback](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)|NeurIPS|2023-12|
100 | |[ReST: Reinforced Self-Training for Language Modeling](https://arxiv.org/abs/2308.08998)|arXiv|2023-08|
101 | |[STaR: Bootstrapping Reasoning With Reasoning](https://arxiv.org/abs/2203.14465)|NeurIPS2022|2022-05|
102 | |[Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html)|NeurIPS|2017-12|
103 | ## Part 6: Reflection
104 | |Title|Venue|Date|
105 | |:---|:---|:---|
106 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12|
107 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11|
108 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11|
109 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11|
110 | |[Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers](https://arxiv.org/abs/2408.06195)|arXiv|2024-08|
111 | |[Reflection-Tuning: An Approach for Data Recycling](https://arxiv.org/abs/2310.11716)|arXiv|2023-10|
112 | ## Part 7: Efficient System2
113 | |Title|Venue|Date|
114 | |:---|:---|:---|
115 | |[O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning](https://arxiv.org/abs/2501.12570)|arXiv|2025-01|
116 | |[Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking](https://arxiv.org/abs/2501.01306)|arXiv|2025-01|
117 | |[DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models](https://arxiv.org/abs/2407.01009)|EMNLP|2024-12|
118 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12|
119 | |[Token-Budget-Aware LLM Reasoning](https://arxiv.org/abs/2412.18547)|arXiv|2024-12|
120 | |[Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/abs/2412.06769)|arXiv|2024-12|
121 | |[Guiding Language Model Reasoning with Planning Tokens](https://arxiv.org/abs/2310.05707)|CoLM|2024-10|
122 | ## Part 8: Explainability
123 | |Title|Venue|Date|
124 | |:---|:---|:---|
125 | |[Agents Thinking Fast and Slow: A Talker-Reasoner Architecture](https://openreview.net/forum?id=xPhcP6rbI4)|NeurIPS WorkShop|2024-12|
126 | |[What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective](https://arxiv.org/abs/2410.23743)|arXiv|2024-10|
127 | |[When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1](https://arxiv.org/abs/2410.01792)|arXiv|2024-10|
128 | |[The Impact of Reasoning Step Length on Large Language Models](https://arxiv.org/abs/2401.04925)|ACL Findings|2024-08|
129 | |[Distilling System 2 into System 1](https://arxiv.org/abs/2407.06023)|arXiv|2024-07|
130 | |[System 2 Attention (is something you might need too)](https://arxiv.org/abs/2311.11829)|arXiv|2023-11|
131 | ## Part 9: Multimodal Agent related Slow-Fast System
132 | |Title|Venue|Date|
133 | |:---|:---|:---|
134 | |[Diving into Self-Evolving Training for Multimodal Reasoning](https://arxiv.org/abs/2412.17451)|ICLR|2025-01|
135 | |[Visual Agents as Fast and Slow Thinkers](https://openreview.net/forum?id=ncCuiD3KJQ)|ICLR|2025-01|
136 | |[Virgo: A Preliminary Exploration on Reproducing o1-like MLLM](https://arxiv.org/abs/2501.01904)|arXiv|2025-01|
137 | |[Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension](https://arxiv.org/pdf/2412.03704)|arXiv|2024-12|
138 | |[Slow Perception: Let's Perceive Geometric Figures Step-by-Step](https://arxiv.org/abs/2412.20631)|arXiv|2024-12|
139 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11|
140 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11|
141 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11|
142 | ## Part 10: Benchmark and Datasets
143 | |Title|Venue|Date|
144 | |:---|:---|:---|
145 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models](https://arxiv.org/abs/2501.03124)|arXiv|2025-01|
146 | |[MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs](https://openreview.net/forum?id=GN2qbxZlni)|NeurIPS|2024-12|
147 | |[Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs](https://arxiv.org/abs/2412.21187)|arXiv|2024-12|
148 | |[A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?](https://arxiv.org/abs/2409.15277)|arXiv|2024-09|
149 | 


--------------------------------------------------------------------------------
/src/timeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/src/timeline.png


--------------------------------------------------------------------------------