├── README.md ├── assets ├── develope.jpg ├── paper.json ├── timeline.jpg ├── timeline.png └── timeline_2.png └── src ├── list.md ├── main.py ├── table.md └── timeline.png /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-System2-Reasoning-LLM 2 | 3 | [![arXiv](https://img.shields.io/badge/arXiv-Slow_Reason_System-b31b1b.svg)](http://arxiv.org/abs/2502.17419) 4 | [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/zzli2022/System2-Reasoning-LLM) 5 | [![Last Commit](https://img.shields.io/github/last-commit/zzli2022/Awesome-System2-Reasoning-LLM)](https://github.com/zzli2022/System2-Reasoning-LLM) 6 | [![Contribution Welcome](https://img.shields.io/badge/Contributions-welcome-blue)]() 7 | 8 | 9 | ## 📢 Updates 10 | 11 | - **2025.02**: We released a survey paper "[From System 1 to System 2: A Survey of Reasoning Large Language Models](http://arxiv.org/abs/2502.17419)". Feel free to cite or open pull requests. 12 | 13 | 14 | ## 👀 Introduction 15 | 16 | Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to [our survey paper](http://arxiv.org/abs/2502.17419). 17 | 18 | Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAI’s o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field. 19 | 20 | 21 | ![image](./assets/develope.jpg) 22 | 23 | This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges. 24 | 25 | ![image](./assets/timeline_2.png) 26 | The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects. 27 | 28 | 29 | ## 📒 Table of Contents 30 | 31 | - [Awesome-System-2-AI](#awesome-system-2-ai) 32 | - [Part 1: O1 Replication](#part-1-o1-replication) 33 | - [Part 2: Process Reward Models](#part-2-process-reward-models) 34 | - [Part 3: Reinforcement Learning](#part-3-reinforcement-learning) 35 | - [Part 4: MCTS/Tree Search](#part-4-mctstree-search) 36 | - [Part 5: Self-Training / Self-Improve](#part-5-self-training--self-improve) 37 | - [Part 6: Reflection](#part-6-reflection) 38 | - [Part 7: Efficient System2](#part-7-efficient-system2) 39 | - [Part 8: Explainability](#part-8-explainability) 40 | - [Part 9: Multimodal Agent related Slow-Fast System](#part-9-multimodal-agent-related-slow-fast-system) 41 | - [Part 10: Benchmark and Datasets](#part-10-benchmark-and-datasets) 42 | - [Part 11: Reasoning and Safety](#part-11-reasoning-and-safety) 43 | - [Part 12: R1 Driven Multimodal Reasoning Enhancement](#part-12-r1-driven-multimodal-reasoning-enhancement) 44 | 45 | ## Part 1: O1 Replication 46 | 47 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) ![](https://img.shields.io/badge/arXiv-2024.10-red) 48 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) ![](https://img.shields.io/badge/arXiv-2024.11-red) 49 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red) 50 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) ![](https://img.shields.io/badge/arXiv-2024.11-red) 51 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) ![](https://img.shields.io/badge/arXiv-2024.12-red) 52 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) ![](https://img.shields.io/badge/arXiv-2024.12-red) 53 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red) 54 | * DRT: Deep Reasoning Translation via Long Chain-of-Thought [[Paper]](https://arxiv.org/abs/2412.17498) ![](https://img.shields.io/badge/arXiv-2024.12-red) 55 | * mini-deepseek-r1 [[Blog]](https://www.philschmid.de/mini-deepseek-r1) ![](https://img.shields.io/badge/blog-2025.01-red) 56 | * Run DeepSeek R1 Dynamic 1.58-bit [[Blog]](https://unsloth.ai/blog/deepseekr1-dynamic) ![](https://img.shields.io/badge/blog-2025.01-red) 57 | * Simple Reinforcement Learning for Reasoning [[Notion]](https://hkust-nlp.notion.site/simplerl-reason) ![](https://img.shields.io/badge/Notion-2025.01-red) 58 | * TinyZero [[github]](https://github.com/Jiayi-Pan/TinyZero) ![](https://img.shields.io/badge/github-2025.01-red) 59 | * Open R1 [[github]](https://github.com/huggingface/open-r1) ![](https://img.shields.io/badge/github-2025.01-red) 60 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red) 61 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red) 62 | * The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [[Paper]](https://arxiv.org/abs/2502.15631) ![](https://img.shields.io/badge/arXiv-2025.02-red) 63 | * Open-Reasoner-Zero [[Paper]](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) ![](https://img.shields.io/badge/pdf-2025.02-red) 64 | * X-R1 [[github]](https://github.com/dhcode-cpp/X-R1) ![](https://img.shields.io/badge/github-2025.02-red) 65 | * Unlock-Deepseek [[Blog]](https://mp.weixin.qq.com/s/Z7P61IV3n4XYeC0Et_fvwg) ![](https://img.shields.io/badge/blog-2025.02-red) 66 | * Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.14768) ![](https://img.shields.io/badge/arXiv-2025.02-red) 67 | * LLM-R1 [[github]](https://github.com/TideDra/lmm-r1) ![](https://img.shields.io/badge/github-2025.02-red) 68 | ## Part 2: Process Reward Models 69 | 70 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) ![](https://img.shields.io/badge/arXiv-2022.11-red) 71 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) ![](https://img.shields.io/badge/arXiv-2023.06-red) 72 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) ![](https://img.shields.io/badge/arXiv-2023.06-red) 73 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) ![](https://img.shields.io/badge/ACL-2024-blue) 74 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) ![](https://img.shields.io/badge/ACL_Findings-2024-blue) 75 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) ![](https://img.shields.io/badge/arXiv-2024.05-red) 76 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) ![](https://img.shields.io/badge/arXiv-2024.06-red) 77 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 78 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 79 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) ![](https://img.shields.io/badge/arXiv-2024.12-red) 80 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) ![](https://img.shields.io/badge/arXiv-2024.12-red) 81 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red) 82 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) ![](https://img.shields.io/badge/arXiv-2025.01-red) 83 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) ![](https://img.shields.io/badge/arXiv-2025.01-red) 84 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red) 85 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) ![](https://img.shields.io/badge/ICLR-2025-blue) 86 | * Uncertainty-Aware Step-wise Verification with Generative Reward Models [[Paper]](https://arxiv.org/abs/2502.11250) ![](https://img.shields.io/badge/arXiv-2025.02-red) 87 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://www.arxiv.org/abs/2502.13943) ![](https://img.shields.io/badge/arXiv-2025.02-red) 88 | * Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [[Paper]](https://www.arxiv.org/abs/2502.08922) ![](https://img.shields.io/badge/arXiv-2025.02-red) 89 | * Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.06703) ![](https://img.shields.io/badge/arXiv-2025.02-red) 90 | * Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [[Paper]](https://arxiv.org/abs/2502.19328) ![](https://img.shields.io/badge/arXiv-2025.02-red) 91 | * Unified Reward Model for Multimodal Understanding and Generation [[Paper]](https://arxiv.org/abs/2503.05236) ![](https://img.shields.io/badge/arXiv-2025.02-red) 92 | * Reward Shaping to Mitigate Reward Hacking in RLHF [[Paper]](https://arxiv.org/abs/2502.18770) ![](https://img.shields.io/badge/arXiv-2025.02-red) 93 | * Multi-head Reward Aggregation Guided by Entropy [[Paper]](https://arxiv.org/abs/2503.20995) ![](https://img.shields.io/badge/arXiv-2025.03-red) 94 | * [[Paper]](https://arxiv.org/abs/2503.21295) ![](https://img.shields.io/badge/arXiv-2025.03-red) 95 | * Better Process Supervision with Bi-directional Rewarding Signals [[Paper]](https://arxiv.org/abs/2503.04618) ![](https://img.shields.io/badge/arXiv-2025.03-red) 96 | * Inference-Time Scaling for Generalist Reward Modeling [[Paper]](https://arxiv.org/abs/2504.02495) ![](https://img.shields.io/badge/arXiv-2025.04-red) 97 | 98 | ## Part 3: Reinforcement Learning 99 | 100 | * Improve Vision Language Model Chain-of-thought Reasoning [[Paper]](https://arxiv.org/abs/2410.16198) ![](https://img.shields.io/badge/arXiv-2024.10-red) 101 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) ![](https://img.shields.io/badge/arXiv-2024.12-red) 102 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) ![](https://img.shields.io/badge/arXiv-2024.12-red) 103 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue) 104 | * InfAlign: Inference-aware language model alignment [[Paper]](https://arxiv.org/abs/2412.19792) ![](https://img.shields.io/badge/arXiv-2024.12-red) 105 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) ![](https://img.shields.io/badge/arXiv-2025.01-red) 106 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) ![](https://img.shields.io/badge/arXiv-2025.01-red) 107 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) ![](https://img.shields.io/badge/arXiv-2025.01-red) 108 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) ![](https://img.shields.io/badge/arXiv-2025.01-red) 109 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) ![](https://img.shields.io/badge/arXiv-2024.02-red) 110 | * Reasoning with Reinforced Functional Token Tuning [[Paper]](https://arxiv.org/abs/2502.13389) ![](https://img.shields.io/badge/arXiv-2025.02-red) 111 | * Value-Based Deep RL Scales Predictably [[Paper]](https://arxiv.org/abs/2502.04327) ![](https://img.shields.io/badge/arXiv-2025.02-red) 112 | * MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [[Paper]](https://arxiv.org/abs/2502.10391) ![](https://img.shields.io/badge/arXiv-2025.02-red) 113 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) ![](https://img.shields.io/badge/arXiv-2025.02-red) 114 | * DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [[Paper]](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) ![](https://img.shields.io/badge/Notion-2025.02-red) 115 | * LIMR: Less is More for RL Scaling [[Paper]](https://arxiv.org/abs/2502.11886) ![](https://img.shields.io/badge/arXiv-2025.02-red) 116 | * A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [[Paper]](https://arxiv.org/abs/2502.143) ![](https://img.shields.io/badge/arXiv-2025.02-red) 117 | * Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.19655) ![](https://img.shields.io/badge/arXiv-2025.02-red) 118 | * QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [[Paper]](https://arxiv.org/abs/2502.02584) ![](https://img.shields.io/badge/arXiv-2025.02-red) 119 | * Process Reinforcement through Implicit Rewards [[Paper]](https://arxiv.org/abs/2502.01456) ![](https://img.shields.io/badge/arXiv-2025.02-red) 120 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.21620) ![](https://img.shields.io/badge/arXiv-2025.03-red) 121 | * All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01067) ![](https://img.shields.io/badge/arXiv-2025.03-red) 122 | * R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [[Paper]](https://arxiv.org/abs/2503.05132) ![](https://img.shields.io/badge/arXiv-2025.03-red) 123 | * Visual-RFT: Visual Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01785) ![](https://img.shields.io/badge/arXiv-2025.03-red) 124 | * GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [[Paper]](https://arxiv.org/abs/2503.08525) ![](https://img.shields.io/badge/arXiv-2025.03-red) 125 | * L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.04697) ![](https://img.shields.io/badge/arXiv-2025.03-red) 126 | * Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [[Paper]](https://arxiv.org/abs/2503.16219) ![](https://img.shields.io/badge/arXiv-2025.03-red) 127 | * Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement [[Paper]](https://arxiv.org/abs/2503.07065) ![](https://img.shields.io/badge/arXiv-2025.03-red) 128 | * VLAA-Thinker [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/) ![](https://img.shields.io/badge/github-2025.04-red) 129 | * Concise Reasoning via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05185) ![](https://img.shields.io/badge/arXiv-2025.04-red) 130 | * d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [[github]](https://dllm-reasoning.github.io/media/preprint.pdf) ![](https://img.shields.io/badge/github-2025.04-red) 131 | * Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05108) ![](https://img.shields.io/badge/arXiv-2025.04-red) 132 | * Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [[Paper]](https://arxiv.org/pdf/2504.05520) ![](https://img.shields.io/badge/arXiv-2025.04-red) 133 | * VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2504.06958) ![](https://img.shields.io/badge/arXiv-2025.04-red) 134 | * SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [[Paper]](https://arxiv.org/abs/2504.11468) ![](https://img.shields.io/badge/arXiv-2025.04-red) 135 | * RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models [[Paper]](https://arxiv.org/abs/2504.07282) ![](https://img.shields.io/badge/arXiv-2025.04-red) 136 | * MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.10160) ![](https://img.shields.io/badge/arXiv-2025.04-red) 137 | * VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [[Paper]](https://arxiv.org/abs/2503.07523) ![](https://img.shields.io/badge/arXiv-2025.04-red) 138 | 139 | ## Part 4: MCTS/Tree Search 140 | 141 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) ![](https://img.shields.io/badge/EMNLP-2023-blue) 142 | * Fine-grained Conversational Decoding via Isotropic and Proximal Search [[Paper]](https://aclanthology.org/2023.emnlp-main.5/) ![](https://img.shields.io/badge/EMNLP-2023-blue) 143 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 144 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue) 145 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue) 146 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) ![](https://img.shields.io/badge/arXiv-2023.09-red) 147 | * Look-back Decoding for Open-Ended Text Generation [[Paper]](https://aclanthology.org/2023.emnlp-main.66/) ![](https://img.shields.io/badge/EMNLP-2023-blue) 148 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) ![](https://img.shields.io/badge/arXiv-2024.04-red) 149 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) ![](https://img.shields.io/badge/arXiv-2024.04-red) 150 | * Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [[Paper]](https://openreview.net/forum?id=CVpuVe1N22¬eId=aTI8PGpO47) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 151 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) ![](https://img.shields.io/badge/arXiv-2024.05-red) 152 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) ![](https://img.shields.io/badge/arXiv-2024.05-red) 153 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) ![](https://img.shields.io/badge/arXiv-2024.05-red) 154 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red) 155 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) ![](https://img.shields.io/badge/arXiv-2024.06-red) 156 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 157 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 158 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) ![](https://img.shields.io/badge/arXiv-2024.07-red) 159 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) ![](https://img.shields.io/badge/arXiv-2024.07-red) 160 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) ![](https://img.shields.io/badge/arXiv-2024.07-red) 161 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) ![](https://img.shields.io/badge/arXiv-2024.08-red) 162 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) ![](https://img.shields.io/badge/arXiv-2024.09-red) 163 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) ![](https://img.shields.io/badge/arXiv-2024.10-red) 164 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) ![](https://img.shields.io/badge/arXiv-2024.10-red) 165 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red) 166 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) ![](https://img.shields.io/badge/arXiv-2024.10-red) 167 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) ![](https://img.shields.io/badge/arXiv-2024.10-red) 168 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) ![](https://img.shields.io/badge/arXiv-2024.10-red) 169 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) ![](https://img.shields.io/badge/arXiv-2024.11-red) 170 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) ![](https://img.shields.io/badge/arXiv-2024.11-red) 171 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) ![](https://img.shields.io/badge/arXiv-2024.11-red) 172 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red) 173 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) ![](https://img.shields.io/badge/arXiv-2024.11-red) 174 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue) 175 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red) 176 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) ![](https://img.shields.io/badge/arXiv-2024.12-red) 177 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red) 178 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red) 179 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) ![](https://img.shields.io/badge/arXiv-2024.12-red) 180 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) ![](https://img.shields.io/badge/arXiv-2024.12-red) 181 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) ![](https://img.shields.io/badge/arXiv-2024.12-red) 182 | * Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [[Paper]](https://aclanthology.org/2024.naacl-short.42/) ![](https://img.shields.io/badge/NAACL-2024-blue) 183 | * Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2502.11169) ![](https://img.shields.io/badge/arXiv-2025.02-red) 184 | * PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [[Paper]](https://arxiv.org/abs/2501.13007) ![](https://img.shields.io/badge/arXiv-2025.01-red) 185 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) ![](https://img.shields.io/badge/ICLR-2025-blue) 186 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) ![](https://img.shields.io/badge/IEEE_TAC-2025-blue) 187 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red) 188 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red) 189 | * LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [[Paper]](https://arxiv.org/abs/2502.17925) ![](https://img.shields.io/badge/arXiv-2025.02-red) 190 | * Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [[Paper]](https://arxiv.org/abs/2502.02339) ![](https://img.shields.io/badge/arXiv-2025.02-red) 191 | * DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [[Paper]](https://arxiv.org/abs/2502.20730) ![](https://img.shields.io/badge/arXiv-2025.02-red) 192 | * Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2502.11881) ![](https://img.shields.io/badge/arXiv-2025.02-red) 193 | * VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [[Paper]](https://arxiv.org/abs/2504.09130) ![](https://img.shields.io/badge/arXiv-2025.04-red) 194 | ## Part 5: Self-Training / Self-Improve 195 | 196 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) ![](https://img.shields.io/badge/NeurIPS-2017-blue) 197 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) ![](https://img.shields.io/badge/arXiv-2022.05-red) 198 | * Large Language Models are Better Reasoners with Self-Verification [[Paper]](/aclanthology.org/2023.findings-emnlp.167/) ![](https://img.shields.io/badge/ACL_Findings-2023-blue) 199 | * Self-Evaluation Guided Beam Search for Reasoning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/81fde95c4dc79188a69ce5b24d63010b-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 200 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 201 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) ![](https://img.shields.io/badge/arXiv-2023.08-red) 202 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 203 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) ![](https://img.shields.io/badge/arXiv-2024.02-red) 204 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) ![](https://img.shields.io/badge/arXiv-2024.03-red) 205 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR-2024-blue) 206 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) ![](https://img.shields.io/badge/arXiv-2024.05-red) 207 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) ![](https://img.shields.io/badge/arXiv-2024.06-red) 208 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://openreview.net/forum?id=pTHfApDakA) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 209 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 210 | * Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [[Paper]](https://openreview.net/forum?id=dcbNzhVVQj#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue) 211 | * Self-Improvement in Language Models: The Sharpening Mechanism [[Paper]](https://arxiv.org/abs/2412.01951) ![](https://img.shields.io/badge/arXiv-2024.12-red) 212 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red) 213 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 214 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red) 215 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) ![](https://img.shields.io/badge/TMLR-2024-blue) 216 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue) 217 | * Enabling Scalable Oversight via Self-Evolving Critic [[Paper]](https://arxiv.org/abs/2501.05727) ![](https://img.shields.io/badge/arXiv-2025.01-red) 218 | * S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [[Paper]](https://www.arxiv.org/abs/2502.12853) ![](https://img.shields.io/badge/arXiv-2025.02-red) 219 | * ProgCo: Program Helps Self-Correction of Large Language Models [[Paper]](https://arxiv.org/abs/2501.01264) ![](https://img.shields.io/badge/arXiv-2025.01-red) 220 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red) 221 | * Self-Training Elicits Concise Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20122) ![](https://img.shields.io/badge/arXiv-2025.02-red) 222 | * Language Models can Self-Improve at State-Value Estimation for Better Search [[Paper]](https://arxiv.org/abs/2503.02878) ![](https://img.shields.io/badge/arXiv-2025.03-red) 223 | * Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasonin [[Paper]](https://arxiv.org/abs/2504.08672) ![](https://img.shields.io/badge/arXiv-2025.04-red) 224 | * START: Self-taught Reasoner with Tools [[Paper]](https://arxiv.org/abs/2503.04625) ![](https://img.shields.io/badge/arXiv-2025.04-red) 225 | ## Part 6: Reflection 226 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://arxiv.org/abs/2308.00436) ![](https://img.shields.io/badge/arXiv-2023.08-red) 227 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) ![](https://img.shields.io/badge/arXiv-2023.10-red) 228 | * Learning From Mistakes Makes LLM Better Reasoner [[Paper]](https://arxiv.org/abs/2310.20689) ![](https://img.shields.io/badge/arXiv-2023.10-red) 229 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) ![](https://img.shields.io/badge/arXiv-2024.08-red) 230 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red) 231 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red) 232 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.18478) ![](https://img.shields.io/badge/arXiv-2024.11-red) 233 | * Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red) 234 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red) 235 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red) 236 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red) 237 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red) 238 | * Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [[Paper]](https://aclanthology.org/2024.findings-emnlp.500/) ![](https://img.shields.io/badge/EMNLP-2024-blue) 239 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red) 240 | * RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [[Paper]](https://arxiv.org/abs/2501.11284) ![](https://img.shields.io/badge/arXiv-2025.01-red) 241 | * Perception in Reflection [[Paper]](https://arxiv.org/abs/2504.07165) ![](https://img.shields.io/badge/arXiv-2025.04-red) 242 | ## Part 7: Efficient System2 243 | 244 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) ![](https://img.shields.io/badge/arXiv-2024.10-red) 245 | * AutoReason: Automatic Few-Shot Reasoning Decomposition [[Paper]](https://arxiv.org/abs/2412.06975) ![](https://img.shields.io/badge/arXiv-2024.12-red) 246 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) ![](https://img.shields.io/badge/arXiv-2024.12-red) 247 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red) 248 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) ![](https://img.shields.io/badge/arXiv-2024.12-red) 249 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) ![](https://img.shields.io/badge/arXiv-2024.12-red) 250 | * From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [[Paper]](https://arxiv.org/abs/2501.16207) ![](https://img.shields.io/badge/arXiv-2025.01-red) 251 | * MALT: Improving Reasoning with Multi-Agent LLM Training [[Paper]](https://arxiv.org/abs/2412.01928) ![](https://img.shields.io/badge/arXiv-2024.12-red) 252 | * Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [[Paper]](https://arxiv.org/abs/2501.18585) ![](https://img.shields.io/badge/arXiv-2025.01-red) 253 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/abs/2501.19201) ![](https://img.shields.io/badge/arXiv-2025.01-red) 254 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) ![](https://img.shields.io/badge/arXiv-2025.01-red) 255 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) ![](https://img.shields.io/badge/arXiv-2025.01-red) 256 | * Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [[Paper]](https://www.arxiv.org/abs/2502.13260) ![](https://img.shields.io/badge/arXiv-2025.02-red) 257 | * Titans: Learning to Memorize at Test Time [[Paper]](https://arxiv.org/abs/2501.00663) ![](https://img.shields.io/badge/arXiv-2025.01-red) 258 | * MoBA: Mixture of Block Attention for Long-Context LLMs [[Paper]](https://arxiv.org/abs/2502.13189) ![](https://img.shields.io/badge/arXiv-2025.02-red) 259 | * One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [[Paper]](https://arxiv.org/abs/2502.10454) ![](https://img.shields.io/badge/arXiv-2025.02-red) 260 | * Small Models Struggle to Learn from Strong Reasoners [[Paper]](https://arxiv.org/abs/2502.12143) ![](https://img.shields.io/badge/arXiv-2025.02-red) 261 | * TokenSkip: Controllable Chain-of-Thought Compression in LLMs [[Paper]](https://arxiv.org/abs/2502.12067) ![](https://img.shields.io/badge/arXiv-2025.02-red) 262 | * SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2502.12134) ![](https://img.shields.io/badge/arXiv-2025.02-red) 263 | * Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [[Paper]](https://arxiv.org/abs/2502.10428) ![](https://img.shields.io/badge/arXiv-2025.02-red) 264 | * Thinking Preference Optimization [[Paper]](https://arxiv.org/abs/2502.13173) ![](https://img.shields.io/badge/arXiv-2025.02-red) 265 | * Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [[Paper]](https://arxiv.org/abs/2502.12215) ![](https://img.shields.io/badge/arXiv-2025.02-red) 266 | * Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [[Paper]](https://arxiv.org/abs/2502.12929) ![](https://img.shields.io/badge/arXiv-2025.02-red) 267 | * CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [[Paper]](https://arxiv.org/abs/2502.07316) ![](https://img.shields.io/badge/arXiv-2025.02-red) 268 | * OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [[Paper]](https://arxiv.org/abs/2502.11271) ![](https://img.shields.io/badge/arXiv-2025.02-red) 269 | * LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.11176) ![](https://img.shields.io/badge/arXiv-2025.02-red) 270 | * Atom of Thoughts for Markov LLM Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.12018) ![](https://img.shields.io/badge/arXiv-2025.02-red) 271 | * Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [[Paper]](https://arxiv.org/abs/2502.11147) ![](https://img.shields.io/badge/arXiv-2025.02-red) 272 | * Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [[Paper]](https://arxiv.org/abs/2502.12855) ![](https://img.shields.io/badge/arXiv-2025.02-red) 273 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) ![](https://img.shields.io/badge/arXiv-2025.02-red) 274 | * Scalable Language Models with Posterior Inference of Latent Thought Vectors [[Paper]](https://arxiv.org/abs/2502.01567) ![](https://img.shields.io/badge/arXiv-2025.02-red) 275 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) ![](https://img.shields.io/badge/arXiv-2025.02-red) 276 | * Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.03275) ![](https://img.shields.io/badge/arXiv-2025.02-red) 277 | * LightThinker: Thinking Step-by-Step Compression [[Paper]](https://arxiv.org/abs/2502.15589) ![](https://img.shields.io/badge/arXiv-2025.02-red) 278 | * The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [[Paper]](https://arxiv.org/pdf/2502.17416) ![](https://img.shields.io/badge/ICLR-2025-blue) 279 | * Reasoning with Latent Thoughts: On the Power of Looped Transformers [[Paper]](https://arxiv.org/pdf/2502.17416) ![](https://img.shields.io/badge/ICLR-2025-blue) 280 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/pdf/2501.19201) ![](https://img.shields.io/badge/arXiv-2025.01-red) 281 | * Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20332) ![](https://img.shields.io/badge/arXiv-2025.02-red) 282 | * Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [[Paper]](https://arxiv.org/abs/2502.11514) ![](https://img.shields.io/badge/arXiv-2025.02-red) 283 | * Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.19918) ![](https://img.shields.io/badge/arXiv-2025.02-red) 284 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[Paper]](https://arxiv.org/abs/2502.20238) ![](https://img.shields.io/badge/arXiv-2025.02-red) 285 | * MixLLM: Dynamic Routing in Mixed Large Language Models [[Paper]](https://arxiv.org/abs/2502.18482) ![](https://img.shields.io/badge/arXiv-2025.02-red) 286 | * PEARL: Towards Permutation-Resilient LLMs [[Paper]](https://arxiv.org/abs/2502.14628) ![](https://img.shields.io/badge/arXiv-2025.02-red) 287 | * Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [[Paper]](https://www.arxiv.org/abs/2502.07803) ![](https://img.shields.io/badge/arXiv-2025.03-red) 288 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[Paper]](https://arxiv.org/abs/2502.19361) ![](https://img.shields.io/badge/arXiv-2025.02-red) 289 | * Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [[Paper]](https://arxiv.org/abs/2502.19411) ![](https://img.shields.io/badge/arXiv-2025.02-red) 290 | * Training Large Language Models to be Better Rule Followers [[Paper]](https://arxiv.org/abs/2502.11525) ![](https://img.shields.io/badge/arXiv-2025.02-red) 291 | * Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [[Paper]](https://arxiv.org/abs/2502.04644) ![](https://img.shields.io/badge/arXiv-2025.02-red) 292 | * CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [[Paper]](https://arxiv.org/abs/2502.21074) ![](https://img.shields.io/badge/arXiv-2025.02-red) 293 | * SIFT: Grounding LLM Reasoning in Contexts via Stickers [[Paper]](https://arxiv.org/abs/2502.14922) ![](https://img.shields.io/badge/arXiv-2025.02-red) 294 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://arxiv.org/abs/2502.13943) ![](https://img.shields.io/badge/arXiv-2025.02-red) 295 | * How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [[Paper]](https://arxiv.org/abs/2503.01141) ![](https://img.shields.io/badge/arXiv-2025.03-red) 296 | * PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2503.02324) ![](https://img.shields.io/badge/arXiv-2025.03-red) 297 | * DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [[Paper]](https://arxiv.org/abs/2503.04472) ![](https://img.shields.io/badge/arXiv-2025.03-red) 298 | * Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [[Paper]](https://arxiv.org/abs/2503.04691) ![](https://img.shields.io/badge/arXiv-2025.03-red) 299 | * Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [[Paper]](https://arxiv.org/abs/2503.09567) ![](https://img.shields.io/badge/arXiv-2025.03-red) 300 | * TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [[Paper]](https://arxiv.org/abs/2503.04872) ![](https://img.shields.io/badge/arXiv-2025.03-red) 301 | * Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [[Paper]](https://arxiv.org/abs/2503.05641) ![](https://img.shields.io/badge/arXiv-2025.03-red) 302 | * Entropy-based Exploration Conduction for Multi-step Reasoning [[Paper]](https://arxiv.org/abs/2503.15848) ![](https://img.shields.io/badge/arXiv-2025.03-red) 303 | * MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [[Paper]](https://arxiv.org/abs/2503.16212) ![](https://img.shields.io/badge/arXiv-2025.03-red) 304 | * Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2503.16419) ![](https://img.shields.io/badge/arXiv-2025.03-red) 305 | * ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [[Paper]](https://arxiv.org/abs/2503.12918) ![](https://img.shields.io/badge/arXiv-2025.03-red) 306 | * Agent models: Internalizing Chain-of-Action Generation into Reasoning models [[Paper]](https://arxiv.org/abs/2503.06580) ![](https://img.shields.io/badge/arXiv-2025.03-red) 307 | * StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [[Paper]](https://arxiv.org/abs/2503.10105) ![](https://img.shields.io/badge/arXiv-2025.03-red) 308 | * Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [[Paper]](https://arxiv.org/abs/2503.10183) ![](https://img.shields.io/badge/arXiv-2025.03-red) 309 | * Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [[Paper]](https://arxiv.org/abs/2503.19877) ![](https://img.shields.io/badge/arXiv-2025.03-red) 310 | * Shared Global and Local Geometry of Language Model Embeddings [[Paper]](https://arxiv.org/abs/2503.21073) ![](https://img.shields.io/badge/arXiv-2025.03-red) 311 | * Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [[Paper]](https://arxiv.org/abs/2503.13360) ![](https://img.shields.io/badge/arXiv-2025.03-red) 312 | * Effectively Controlling Reasoning Models through Thinking Intervention [[Paper]](https://arxiv.org/abs/2503.24370) ![](https://img.shields.io/badge/arXiv-2025.03-red) 313 | * Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [[Paper]](https://arxiv.org/abs/2503.02318) ![](https://img.shields.io/badge/arXiv-2025.03-red) 314 | * TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [[Paper]](https://arxiv.org/abs/2504.09641) ![](https://img.shields.io/badge/arXiv-2025.04-red) 315 | * Lemmanaid: Neuro-Symbolic Lemma Conjecturing [[Paper]](https://arxiv.org/abs/2504.04942) ![](https://img.shields.io/badge/arXiv-2025.04-red) 316 | * ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning [[Paper]](https://arxiv.org/abs/2504.06650) ![](https://img.shields.io/badge/arXiv-2025.04-red) 317 | * Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [[Paper]](https://arxiv.org/abs/2504.05599) ![](https://img.shields.io/badge/arXiv-2025.04-red) 318 | * Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [[Paper]](https://arxiv.org/abs/2504.05419) ![](https://img.shields.io/badge/arXiv-2025.04-red) 319 | * Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? [[Paper]](https://arxiv.org/abs/2504.06514) ![](https://img.shields.io/badge/arXiv-2025.04-red) 320 | * Decentralizing AI Memory: SHIMI, a Semantic Hierarchical Memory Index for Scalable Agent Reasoning [[Paper]](https://arxiv.org/abs/2504.06135) ![](https://img.shields.io/badge/arXiv-2025.04-red) 321 | * Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [[Paper]](https://arxiv.org/pdf/2504.05632) ![](https://img.shields.io/badge/arXiv-2025.04-red) 322 | * Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [[Paper]](https://arxiv.org/abs/2504.00294) ![](https://img.shields.io/badge/arXiv-2025.04-red) 323 | * RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [[Paper]](https://arxiv.org/abs/2504.10081) ![](https://img.shields.io/badge/arXiv-2025.04-red) 324 | * Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [[Paper]](https://arxiv.org/abs/2504.11741) ![](https://img.shields.io/badge/arXiv-2025.04-red) 325 | 326 | 327 | ## Part 8: Explainability 328 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2024-blue) 329 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) ![](https://img.shields.io/badge/arXiv-2024.07-red) 330 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) ![](https://img.shields.io/badge/arXiv-2024.08-red) 331 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) ![](https://img.shields.io/badge/arXiv-2024.10-red) 332 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) ![](https://img.shields.io/badge/arXiv-2024.10-red) 333 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) ![](https://img.shields.io/badge/arXiv-2023.11-red) 334 | * Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [[Paper]](https://arxiv.org/abs/2501.04682) ![](https://img.shields.io/badge/arXiv-2025.01-red) 335 | * LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [[Paper]](https://arxiv.org/abs/2501.06186) ![](https://img.shields.io/badge/arXiv-2025.01-red) 336 | * Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [[Paper]](https://arxiv.org/abs/2502.19230) ![](https://img.shields.io/badge/arXiv-2025.02-red) 337 | * Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2503.11074) ![](https://img.shields.io/badge/arXiv-2025.03-red) 338 | * Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [[Paper]](https://arxiv.org/abs/2503.15558) ![](https://img.shields.io/badge/arXiv-2025.03-red) 339 | ## Part 9: Multimodal Agent related Slow-Fast System 340 | 341 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red) 342 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red) 343 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red) 344 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) ![](https://img.shields.io/badge/arXiv-2024.12-red) 345 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) ![](https://img.shields.io/badge/arXiv-2024.12-red) 346 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) ![](https://img.shields.io/badge/arXiv-2025.01-red) 347 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) ![](https://img.shields.io/badge/ICLR-2025-blue) 348 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red) 349 | * I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [[Paper]](https://arxiv.org/abs/2502.10458) ![](https://img.shields.io/badge/arXiv-2025.02-red) 350 | * RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [[Paper]](https://arxiv.org/abs/2502.13957) ![](https://img.shields.io/badge/arXiv-2025.02-red) 351 | ## Part 10: Benchmark and Datasets 352 | 353 | * Evaluation of OpenAI o1: Opportunities and Challenges of AGI [[Paper]](https://arxiv.org/abs/2409.18486) ![](https://img.shields.io/badge/arXiv-2024.09-red) 354 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) ![](https://img.shields.io/badge/arXiv-2024.09-red) 355 | * FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [[Paper]](https://arxiv.org/abs/2411.04872) ![](https://img.shields.io/badge/arXiv-2024.11-red) 356 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 357 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) ![](https://img.shields.io/badge/arXiv-2024.12-red) 358 | * EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [[Paper]](https://arxiv.org/abs/2502.12466) ![](https://img.shields.io/badge/arXiv-2025.02-red) 359 | * SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [[Paper]](https://arxiv.org/abs/2502.14739) ![](https://img.shields.io/badge/arXiv-2025.02-red) 360 | * Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [[Paper]](https://arxiv.org/abs/2502.14191) ![](https://img.shields.io/badge/arXiv-2025.02-red) 361 | * MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [[Paper]](https://arxiv.org/abs/2502.06453) ![](https://img.shields.io/badge/arXiv-2025.02-red) 362 | * LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [[Paper]](https://arxiv.org/abs/2501.15089) ![](https://img.shields.io/badge/arXiv-2025.01-red) 363 | * Humanity's Last Exam [[Paper]](https://arxiv.org/abs/2501.14249) ![](https://img.shields.io/badge/arXiv-2025.01-red) 364 | * RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [[Paper]](https://openreview.net/forum?id=QEHrmQPBdd)![](https://img.shields.io/badge/ICLR(Oral)-2025.01-blue) 365 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red) 366 | * Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [[Paper]](https://arxiv.org/abs/2502.17387) ![](https://img.shields.io/badge/arXiv-2025.02-red) 367 | * ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [[paper]](https://arxiv.org/abs/2502.09696) ![](https://img.shields.io/badge/arXiv-2025.02-red) 368 | * MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [[paper]](https://arxiv.org/abs/2502.09621) ![](https://img.shields.io/badge/arXiv-2025.02-red) 369 | * MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [[paper]](https://arxiv.org/abs/2502.00698) ![](https://img.shields.io/badge/arXiv-2025.02-red) 370 | * LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [[Paper]](https://arxiv.org/abs/2502.17848) ![](https://img.shields.io/badge/arXiv-2025.02-red) 371 | * BIG-Bench Extra Hard [[Paper]](https://arxiv.org/abs/2502.19187) ![](https://img.shields.io/badge/arXiv-2025.02-red) 372 | * MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [[paper]](https://arxiv.org/abs/2502.20808) ![](https://img.shields.io/badge/arXiv-2025.02-red) 373 | * MastermindEval: A Simple But Scalable Reasoning Benchmark [[paper]](https://arxiv.org/abs/2503.05891) ![](https://img.shields.io/badge/arXiv-2025.03-red) 374 | * DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [[paper]](https://arxiv.org/abs/2503.15793) ![](https://img.shields.io/badge/arXiv-2025.03-red) 375 | * V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [[github]](https://github.com/haonan3/V1) ![](https://img.shields.io/badge/github-2025.03-red) 376 | * ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [[paper]](https://arxiv.org/abs/2503.21248) ![](https://img.shields.io/badge/arXiv-2025.03-red) 377 | * S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models [[paper]](https://arxiv.org/abs/2504.10368) ![](https://img.shields.io/badge/arXiv-2025.04-red) 378 | * When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks [[paper]](https://arxiv.org/abs/2504.02010) ![](https://img.shields.io/badge/arXiv-2025.04-red) 379 | * BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [[paper]](https://openai.com/index/browsecomp/) ![](https://img.shields.io/badge/OpenAI-2025.04-red) 380 | * Mle-bench: Evaluating machine learning agents on machine learning engineering [[paper]](https://arxiv.org/abs/2410.07095) ![](https://img.shields.io/badge/arXiv-2024.10-red) 381 | * How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [[paper]](https://arxiv.org/abs/2403.11807) ![](https://img.shields.io/badge/arXiv-2024.03-red) 382 | * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [[paper]](https://openreview.net/forum?id=tN61DTr4Ed) ![](https://img.shields.io/badge/NeurIPS-2024.09-blue) 383 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [[paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red) 384 | * Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [[paper]](https://arxiv.org/abs/2501.11733) ![](https://img.shields.io/badge/arXiv-2025.01-red) 385 | * PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [[paper]](https://arxiv.org/abs/2502.12054) ![](https://img.shields.io/badge/arXiv-2025.02-red) 386 | * Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [[paper]](https://arxiv.org/abs/2502.13092) ![](https://img.shields.io/badge/arXiv-2025.02-red) 387 | * WebGames: Challenging General-Purpose Web-Browsing AI Agents [[paper]](https://arxiv.org/abs/2502.18356) ![](https://img.shields.io/badge/arXiv-2025.02-red) 388 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.21620) ![](https://img.shields.io/badge/arXiv-2025.03-red) 389 | * Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [[paper]](https://openreview.net/forum?id=HjwK-Tc_Bc) ![](https://img.shields.io/badge/NeurIPS-2022.11-blue) 390 | * Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots [[paper]](https://arxiv.org/abs/2405.07990) ![](https://img.shields.io/badge/arXiv-2024.05-red) 391 | * M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [[paper]](https://aclanthology.org/2024.acl-long.446/) ![](https://img.shields.io/badge/ACL-2024.08-blue) 392 | * PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [[paper]](https://aclanthology.org/2024.findings-acl.962/) ![](https://img.shields.io/badge/ACL_findings-2024.08-blue) 393 | * Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation [[paper]](https://openreview.net/forum?id=t1mAXb4Cop) ![](https://img.shields.io/badge/NeurIPS-2024.09-blue) 394 | * HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [[paper]](https://arxiv.org/abs/2410.12381) ![](https://img.shields.io/badge/arXiv-2024.10-red) 395 | * CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [[paper]](https://arxiv.org/abs/2412.12932) ![](https://img.shields.io/badge/arXiv-2024.12-red) 396 | * ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation [[paper]](https://openreview.net/forum?id=sGpCzsfd1K) ![](https://img.shields.io/badge/ICLR-2025.01-blue) 397 | * Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios [[paper]](https://arxiv.org/abs/2502.19973) ![](https://img.shields.io/badge/arXiv-2025.02-red) 398 | * EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [[paper]](https://arxiv.org/abs/2502.08859) ![](https://img.shields.io/badge/arXiv-2025.02-red) 399 | * Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities [[paper]](https://arxiv.org/abs/2502.11829) ![](https://img.shields.io/badge/arXiv-2025.02-red) 400 | * Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [[paper]](https://arxiv.org/abs/2503.04801) ![](https://img.shields.io/badge/arXiv-2025.03-red) 401 | * MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems [[paper]](https://arxiv.org/abs/2503.01891) ![](https://img.shields.io/badge/arXiv-2025.03-red) 402 | * LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? [[paper]](https://arxiv.org/abs/2503.19990) ![](https://img.shields.io/badge/arXiv-2025.03-red) 403 | * On the measure of intelligence [[paper]](https://arxiv.org/abs/1911.01547) ![](https://img.shields.io/badge/arXiv-2019.11-red) 404 | * Competition-Level Code Generation with AlphaCode [[paper]](https://arxiv.org/abs/2203.07814) ![](https://img.shields.io/badge/arXiv-2022.03-red) 405 | * Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [[paper]](https://aclanthology.org/2023.findings-acl.824/) ![](https://img.shields.io/badge/ACL_findings-2023.07-blue) 406 | * OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [[paper]](https://openreview.net/forum?id=ayF8bEKYQy) ![](https://img.shields.io/badge/-2024.00-blue) 407 | * Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning [[paper]](https://openreview.net/forum?id=YXnwlZe0yf) ![](https://img.shields.io/badge/NeruIPS-2024.00-blue) 408 | * Let's verify step by step [[paper]](https://openreview.net/forum?id=v8L0pN6EOi) ![](https://img.shields.io/badge/ICLR-2024.01-blue) 409 | * Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation [[paper]](https://arxiv.org/abs/2405.11430) ![](https://img.shields.io/badge/arXiv-2024.05-red) 410 | * Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai [[paper]](https://arxiv.org/abs/2411.04872) ![](https://img.shields.io/badge/arXiv-2024.11-red) 411 | * LiveBench: A Challenging, Contamination-Limited LLM Benchmark [[paper]](https://openreview.net/forum?id=sKYHBTAxVa) ![](https://img.shields.io/badge/ICLR-2025.00-blue) 412 | * JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [[paper]](https://arxiv.org/abs/2501.14851) ![](https://img.shields.io/badge/arXiv-2025.01-red) 413 | * MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [[paper]](https://arxiv.org/abs/2501.18362) ![](https://img.shields.io/badge/arXiv-2025.01-red) 414 | * Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [[paper]](https://arxiv.org/abs/2502.15815) ![](https://img.shields.io/badge/arXiv-2025.02-red) 415 | * AIME 2025 [[huggingface]](https://huggingface.co/datasets/opencompass/AIME2025) ![](https://img.shields.io/badge/Huggingface-2025.02-yellow) 416 | * ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning [[paper]](https://arxiv.org/abs/2502.16268) ![](https://img.shields.io/badge/arXiv-2025.02-red) 417 | * ProBench: Benchmarking Large Language Models in Competitive Programming [[paper]](https://arxiv.org/abs/2502.20868) ![](https://img.shields.io/badge/arXiv-2025.02-red) 418 | * ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [[paper]](https://arxiv.org/abs/2502.01100) ![](https://img.shields.io/badge/arXiv-2025.02-red) 419 | * DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization [[paper]](https://openreview.net/forum?id=2Zan4ATYsh) ![](https://img.shields.io/badge/TMLR-2025.02-blue) 420 | * QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [[paper]](https://arxiv.org/abs/2503.22674) ![](https://img.shields.io/badge/arXiv-2025.03-red) 421 | * Benchmarking Reasoning Robustness in Large Language Models [[paper]](https://arxiv.org/abs/2503.04550) ![](https://img.shields.io/badge/arXiv-2025.03-red) 422 | * Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [[paper]](https://arxiv.org/abs/2502.08680) ![](https://img.shields.io/badge/arXiv-2025.02-red) 423 | * Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [[paper]](https://arxiv.org/abs/2502.12521) ![](https://img.shields.io/badge/arXiv-2025.02-red) 424 | * Rewardbench: Evaluating reward models for language modeling [[paper]](https://arxiv.org/abs/2403.13787) ![](https://img.shields.io/badge/arXiv-2024.03-red) 425 | * Evaluating LLMs at Detecting Errors in LLM Responses [[paper]](https://openreview.net/forum?id=dnwRScljXr) ![](https://img.shields.io/badge/COLM-2024.07-blue) 426 | * CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [[paper]](https://aclanthology.org/2024.findings-acl.91/) ![](https://img.shields.io/badge/COLM-2024.08-blue) 427 | * Judgebench: A benchmark for evaluating llm-based judges [[paper]](https://arxiv.org/abs/2410.12784) ![](https://img.shields.io/badge/arXiv-2024.10-red) 428 | * Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection [[paper]](https://arxiv.org/abs/2410.04509) ![](https://img.shields.io/badge/arXiv-2024.10-red) 429 | * Processbench: Identifying process errors in mathematical reasoning [[paper]](https://arxiv.org/abs/2412.06559) ![](https://img.shields.io/badge/arXiv-2024.12-red) 430 | * Medec: A benchmark for medical error detection and correction in clinical notes [[paper]](https://arxiv.org/abs/2412.19260) ![](https://img.shields.io/badge/arXiv-2024.12-red) 431 | * CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [[paper]](https://arxiv.org/abs/2502.16614) ![](https://img.shields.io/badge/arXiv-2025.02-red) 432 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[paper]](https://arxiv.org/abs/2502.19361) ![](https://img.shields.io/badge/arXiv-2025.02-red) 433 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[paper]](https://arxiv.org/abs/2502.20238) ![](https://img.shields.io/badge/arXiv-2025.02-red) 434 | 435 | 436 | 437 | ## Part 11: Reasoning and Safety 438 | * Measuring Faithfulness in Chain-of-Thought Reasoning [[Blog]](https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning) ![](https://img.shields.io/badge/blog-2023.7-red) 439 | * Deliberative Alignment: Reasoning Enables Safer Language Models [[Paper]](https://arxiv.org/abs/2412.16339) ![](https://img.shields.io/badge/arXiv-2024.12-red) 440 | * OpenAI trained o1 and o3 to ‘think’ about its safety policy [[Blog]](https://techcrunch.com/2024/12/22/openai-trained-o1-and-o3-to-think-about-its-safety-policy) ![](https://img.shields.io/badge/blog-2024.12-red) 441 | * Why AI Safety Researchers Are Worried About DeepSeek [[Blog]](https://time.com/7210888/deepseeks-hidden-ai-safety-warning/) ![](https://img.shields.io/badge/blog-2025.1-red) 442 | * OverThink: Slowdown Attacks on Reasoning LLMs [[Paper]](https://arxiv.org/abs/2502.02542) ![](https://img.shields.io/badge/arXiv-2025.02-red) 443 | * GuardReasoner: Towards Reasoning-based LLM Safeguards [[Paper]](https://arxiv.org/abs/2501.18492) ![](https://img.shields.io/badge/ICLR_WorkShop-2025-blue) 444 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) ![](https://img.shields.io/badge/arXiv-2025.02-red) 445 | * ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [[Paper]](https://arxiv.org/abs/2502.13458) ![](https://img.shields.io/badge/arXiv-2025.02-red) 446 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) ![](https://img.shields.io/badge/arXiv-2025.02-red) 447 | * H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [[Paper]](https://arxiv.org/abs/2502.12893) ![](https://img.shields.io/badge/arXiv-2025.02-red) 448 | * BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [[Paper]](https://arxiv.org/abs/2502.12202) ![](https://img.shields.io/badge/arXiv-2025.02-red) 449 | * The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [[Paper]](https://arxiv.org/abs/2502.12659) ![](https://img.shields.io/badge/arXiv-2025.02-red) 450 | * Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [[Blog]](https://far.ai/post/2025-02-r1-redteaming/) ![](https://img.shields.io/badge/blog-2025.02-red) 451 | * Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [[Paper]](https://arxiv.org/abs/2503.00555) ![](https://img.shields.io/badge/arXiv-2025.03-red) 452 | * DeepSeek-R1 Thoughtology: Let's about LLM Reasoning [[Paper]](https://arxiv.org/abs/2504.12659) ![](https://img.shields.io/badge/arXiv-2025.04-red) 453 | * STAR-1: Safer Alignment of Reasoning LLMs with 1K Data [[Paper]](https://arxiv.org/abs/2504.01903) ![](https://img.shields.io/badge/arXiv-2025.04-red) 454 | 455 | ## Part 12: R1 Driven Multimodal Reasoning Enhancement 456 | * Open R1 Video [[github]](https://github.com/Wang-Xiaodong1899/Open-R1-Video) ![](https://img.shields.io/badge/github-2025.02-red) 457 | * R1-Vision: Let's first take a look at the image [[github]](https://github.com/yuyq96/R1-Vision) ![](https://img.shields.io/badge/github-2025.02-red) 458 | * MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [[paper]](https://arxiv.org/abs/2502.19634) ![](https://img.shields.io/badge/arXiv-2025.02-red) 459 | * Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [[github]](https://github.com/baibizhe/Efficient-R1-VLLM) ![](https://img.shields.io/badge/github-2025.03-red) 460 | * MMR1: Advancing the Frontiers of Multimodal Reasoning [[github]](https://github.com/LengSicong/MMR1) ![](https://img.shields.io/badge/github-2025.03-red) 461 | * Skywork-R1V: Pioneering Multimodal Reasoning with CoT [[github]](https://github.com/SkyworkAI/Skywork-R1V/tree/main) ![](https://img.shields.io/badge/github-2025.03-red) 462 | * VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [[Blog]](https://om-ai-lab.github.io/index.html) ![](https://img.shields.io/badge/blog-2025.03-red) 463 | * Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [[paper]](https://arxiv.org/abs/2503.24376v1) ![](https://img.shields.io/badge/arXiv-2025.03-red) 464 | * Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [[paper]](https://arxiv.org/abs/2503.20752) ![](https://img.shields.io/badge/arXiv-2025.03-red) 465 | * MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.07365) ![](https://img.shields.io/badge/arXiv-2025.03-red) 466 | * R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.05379) ![](https://img.shields.io/badge/arXiv-2025.03-red) 467 | * R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [[paper]](https://arxiv.org/abs/2503.10615) ![](https://img.shields.io/badge/arXiv-2025.03-red) 468 | * R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [[paper]](https://arxiv.org/abs/2503.12937) ![](https://img.shields.io/badge/arXiv-2025.03-red) 469 | * Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [[paper]](https://arxiv.org/abs/2503.11197) ![](https://img.shields.io/badge/arXiv-2025.03-red) 470 | * Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [[paper]](https://arxiv.org/abs/2503.06520) ![](https://img.shields.io/badge/arXiv-2025.03-red) 471 | * TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [[paper]](https://arxiv.org/abs/2503.13377) ![](https://img.shields.io/badge/arXiv-2025.03-red) 472 | * Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [[paper]](https://arxiv.org/abs/2503.06749) ![](https://img.shields.io/badge/arXiv-2025.03-red) 473 | * Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [[paper]](http://arxiv.org/abs/2503.22679) ![](https://img.shields.io/badge/arXiv-2025.03-red) 474 | * Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [[paper]](http://arxiv.org/abs/2504.02587) ![](https://img.shields.io/badge/arXiv-2025.04-red) 475 | * VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [[paper]](http://arxiv.org/abs/2504.07615) ![](https://img.shields.io/badge/arXiv-2025.04-red) 476 | * VLAA-Thinking [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/blob/main/assets/VLAA-Thinker.pdf) ![](https://img.shields.io/badge/github-2025.04-red) 477 | * SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [[paper]](http://arxiv.org/abs/2504.07934) ![](https://img.shields.io/badge/arXiv-2025.04-red) 478 | * Perception-R1: Pioneering Perception Policy with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.07954) ![](https://img.shields.io/badge/arXiv-2025.04-red) 479 | * VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.08837) ![](https://img.shields.io/badge/arXiv-2025.04-red) 480 | * Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.12680) ![](https://img.shields.io/badge/arXiv-2025.04-red) 481 | * NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [[paper]](https://arxiv.org/abs/2504.13055v1) ![](https://img.shields.io/badge/arXiv-2025.04-red) 482 | 483 | 484 | 485 | 486 | ## Citation 487 | If you find this work useful, welcome to cite us. 488 | ```bib 489 | @misc{li202512surveyreasoning, 490 | title={From System 1 to System 2: A Survey of Reasoning Large Language Models}, 491 | author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu}, 492 | year={2025}, 493 | eprint={2502.17419}, 494 | archivePrefix={arXiv}, 495 | primaryClass={cs.AI}, 496 | url={https://arxiv.org/abs/2502.17419}, 497 | } 498 | ``` 499 | 500 | 501 | ## ⭐ Star History 502 | 503 | 504 | 505 | 506 | 507 | Star History Chart 508 | 509 | 510 | -------------------------------------------------------------------------------- /assets/develope.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/develope.jpg -------------------------------------------------------------------------------- /assets/paper.json: -------------------------------------------------------------------------------- 1 | { 2 | "Part 1: O1 Replication": [ 3 | { 4 | "paper": "O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?", 5 | "link": "https://arxiv.org/abs/2411.16489", 6 | "venue": "arXiv", 7 | "date": "2024-11", 8 | "label": "huang2024o1" 9 | }, 10 | { 11 | "paper": "O1 Replication Journey: A Strategic Progress Report -- Part 1", 12 | "link": "https://arxiv.org/abs/2410.18982", 13 | "venue": "arXiv", 14 | "date": "2024-10", 15 | "label": "qin2024o1" 16 | }, 17 | { 18 | "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions", 19 | "link": "https://arxiv.org/abs/2411.14405", 20 | "venue": "arXiv", 21 | "date": "2024-11", 22 | "label": "zhao2024marco" 23 | }, 24 | { 25 | "paper": "o1-Coder: an o1 Replication for Coding", 26 | "link": "https://arxiv.org/abs/2412.00154", 27 | "venue": "arXiv", 28 | "date": "2024-12", 29 | "label": "zhang2024o1" 30 | }, 31 | { 32 | "paper": "Enhancing LLM Reasoning with Reward-guided Tree Search", 33 | "link": "https://arxiv.org/abs/2411.11694", 34 | "venue": "arXiv", 35 | "date": "2024-11", 36 | "label": "chen2024enhancing" 37 | }, 38 | { 39 | "paper": "Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems", 40 | "link": "https://arxiv.org/abs/2412.09413", 41 | "venue": "arXiv", 42 | "date": "2024-12", 43 | "label": "min2024imitate" 44 | } 45 | ], 46 | "Part 2: Process Reward Models": [ 47 | { 48 | "paper": "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations", 49 | "link": "https://aclanthology.org/2024.acl-long.510/", 50 | "venue": "ACL", 51 | "date": "2024-08", 52 | "label": "wang2024mathshepherd" 53 | }, 54 | { 55 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search", 56 | "link": "https://openreview.net/forum?id=8rcFOqEud5", 57 | "venue": "NeurIPS", 58 | "date": "2024-12", 59 | "label": "zhang2024restmcts" 60 | }, 61 | { 62 | "paper": "Let's Verify Step by Step.", 63 | "link": "https://arxiv.org/abs/2305.20050", 64 | "venue": "ICLR", 65 | "date": "2024-05", 66 | "label": "lightman2023letsverify" 67 | }, 68 | { 69 | "paper": "Making Large Language Models Better Reasoners with Step-Aware Verifier", 70 | "link": "https://arxiv.org/abs/2206.02336", 71 | "venue": "arXiv", 72 | "date": "2023-06", 73 | "label": "yuan2023stepaware" 74 | }, 75 | { 76 | "paper": "Improve Mathematical Reasoning in Language Models by Automated Process Supervision", 77 | "link": "https://arxiv.org/abs/2306.05372", 78 | "venue": "arXiv", 79 | "date": "2023-06", 80 | "label": "chen2023automatedprocess" 81 | }, 82 | { 83 | "paper": "OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning", 84 | "link": "https://aclanthology.org/2024.findings-naacl.55/", 85 | "venue": "ACL Findings", 86 | "date": "2024-08", 87 | "label": "liu2023ovm" 88 | }, 89 | { 90 | "paper": "Solving Math Word Problems with Process and Outcome-Based Feedback", 91 | "link": "https://arxiv.org/abs/2211.14275", 92 | "venue": "arXiv", 93 | "date": "2022-11", 94 | "label": "zhang2023processoutcome" 95 | }, 96 | { 97 | "paper": "AutoPSV: Automated Process-Supervised Verifier", 98 | "link": "https://openreview.net/forum?id=eOAPWWOGs9", 99 | "venue": "NeurIPS", 100 | "date": "2024-12", 101 | "label": "lu2024autopsv" 102 | }, 103 | { 104 | "paper": "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs", 105 | "link": "https://arxiv.org/abs/2406.18629", 106 | "venue": "arXiv", 107 | "date": "2024-06", 108 | "label": "chen2023stepdpo" 109 | }, 110 | { 111 | "paper": "ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding", 112 | "link": "https://arxiv.org/abs/2501.07861", 113 | "venue": "arXiv", 114 | "date": "2025-01", 115 | "label": "li2025rearter" 116 | }, 117 | { 118 | "paper": "The Lessons of Developing Process Reward Models in Mathematical Reasoning.", 119 | "link": "https://arxiv.org/abs/2501.07301", 120 | "venue": "arXiv", 121 | "date": "2025-01", 122 | "label": "gao2025lessons" 123 | }, 124 | { 125 | "paper": "Outcome-Refining Process Supervision for Code Generation", 126 | "link": "https://arxiv.org/abs/2412.15118", 127 | "venue": "arXiv", 128 | "date": "2024-12", 129 | "label": "chen2024outcomerefining" 130 | }, 131 | { 132 | "paper": "Free Process Rewards without Process Labels.", 133 | "link": "https://arxiv.org/abs/2412.01981", 134 | "venue": "arXiv", 135 | "date": "2024-12", 136 | "label": "yuan2024freeprocess" 137 | }, 138 | { 139 | "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.", 140 | "link": "https://arxiv.org/abs/2501.03124", 141 | "venue": "arXiv", 142 | "date": "2025-01", 143 | "label": "liu2025prmbench" 144 | }, 145 | { 146 | "paper": "ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.", 147 | "link": "https://arxiv.org/abs/2501.01290", 148 | "venue": "arXiv", 149 | "date": "2025-01", 150 | "label": "zhang2024toolcomp" 151 | } 152 | ], 153 | "Part 3: Reinforcement Learning": [ 154 | { 155 | "paper": "Offline Reinforcement Learning for LLM Multi-Step Reasoning", 156 | "link": "https://arxiv.org/abs/2412.16145", 157 | "venue": "arXiv", 158 | "date": "2024-12", 159 | "label": "wang2024offline" 160 | }, 161 | { 162 | "paper": "ReFT: Representation Finetuning for Language Models", 163 | "link": "https://aclanthology.org/2024.acl-long.410.pdf", 164 | "venue": "ACL", 165 | "date": "2024-08", 166 | "label": "wu2024reft" 167 | }, 168 | { 169 | "paper": "Deepseekmath: Pushing the limits of mathematical reasoning in open language models", 170 | "link": "https://arxiv.org/abs/2402.03300", 171 | "venue": "arXiv", 172 | "date": "2024-02", 173 | "label": "lee2024deepseekmath" 174 | }, 175 | { 176 | "paper": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", 177 | "link": "https://arxiv.org/abs/2501.12948", 178 | "venue": "arXiv", 179 | "date": "2025-01", 180 | "label": "luong2025deepseekr1" 181 | }, 182 | { 183 | "paper": "Kimi k1.5: Scaling Reinforcement Learning with LLMs", 184 | "link": "https://arxiv.org/abs/2501.12599", 185 | "venue": "arXiv", 186 | "date": "2025-01", 187 | "label": "liu2025kimi" 188 | }, 189 | { 190 | "paper": "Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search", 191 | "link": "https://arxiv.org/abs/2502.02508", 192 | "venue": "arXiv", 193 | "date": "2025-02", 194 | "label": "zhang2025satori" 195 | }, 196 | { 197 | "paper": "Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling", 198 | "link": "https://arxiv.org/abs/2501.11651", 199 | "venue": "arXiv", 200 | "date": "2025-01", 201 | "label": "wang2025advancing" 202 | }, 203 | { 204 | "paper": "Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies", 205 | "link": "https://arxiv.org/abs/2501.17030", 206 | "venue": "arXiv", 207 | "date": "2025-01", 208 | "label": "chen2025aisafety" 209 | }, 210 | { 211 | "paper": "Does RLHF Scale? Exploring the Impacts From Data, Model, and Method", 212 | "link": "https://arxiv.org/abs/2412.06000", 213 | "venue": "arXiv", 214 | "date": "2024-12", 215 | "label": "lee2024rlhf" 216 | } 217 | ], 218 | "Part 4: MCTS/Tree Search": [ 219 | { 220 | "paper": "Reasoning with Language Model is Planning with World Model", 221 | "link": "https://aclanthology.org/2023.emnlp-main.507/", 222 | "venue": "EMNLP", 223 | "date": "2023-12", 224 | "label": "hao2023rap" 225 | }, 226 | { 227 | "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning", 228 | "link": "https://arxiv.org/abs/2405.00451", 229 | "venue": "arXiv", 230 | "date": "2024-05", 231 | "label": "zhou2024mctsboost" 232 | }, 233 | { 234 | "paper": "Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training", 235 | "link": "https://openreview.net/forum?id=PJfc4x2jXY", 236 | "venue": "NeurIPS WorkShop", 237 | "date": "2023-12", 238 | "label": "wang2023alphazero" 239 | }, 240 | { 241 | "paper": "Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding", 242 | "link": "https://openreview.net/forum?id=kh9Zt2Ldmn#discussion", 243 | "venue": "CoLM", 244 | "date": "2024-10", 245 | "label": "chen2024valuemcts" 246 | }, 247 | { 248 | "paper": "Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search", 249 | "link": "https://arxiv.org/abs/2412.18319", 250 | "venue": "arXiv", 251 | "date": "2024-12", 252 | "label": "li2024mulberry" 253 | }, 254 | { 255 | "paper": "Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning", 256 | "link": "https://arxiv.org/abs/2412.17397", 257 | "venue": "arXiv", 258 | "date": "2024-12", 259 | "label": "liu2024intrinsicmcts" 260 | }, 261 | { 262 | "paper": "Proposing and solving olympiad geometry with guided tree search", 263 | "link": "https://arxiv.org/abs/2412.10673", 264 | "venue": "arXiv", 265 | "date": "2024-12", 266 | "label": "zhang2024geometrymcts" 267 | }, 268 | { 269 | "paper": "SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models", 270 | "link": "https://arxiv.org/abs/2412.11605", 271 | "venue": "arXiv", 272 | "date": "2024-12", 273 | "label": "wang2024spar" 274 | }, 275 | { 276 | "paper": "Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning", 277 | "link": "https://arxiv.org/abs/2412.09078", 278 | "venue": "arXiv", 279 | "date": "2024-12", 280 | "label": "xu2024forest" 281 | }, 282 | { 283 | "paper": "SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation", 284 | "link": "https://arxiv.org/abs/2411.11053", 285 | "venue": "arXiv", 286 | "date": "2024-11", 287 | "label": "liu2024sramcts" 288 | }, 289 | { 290 | "paper": "MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree", 291 | "link": "https://arxiv.org/abs/2411.15645", 292 | "venue": "arXiv", 293 | "date": "2024-11", 294 | "label": "wang2024mcnest" 295 | }, 296 | { 297 | "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions", 298 | "link": "https://arxiv.org/abs/2411.14405", 299 | "venue": "arXiv", 300 | "date": "2024-11", 301 | "label": "zhang2024marcoo1" 302 | }, 303 | { 304 | "paper": "GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection", 305 | "link": "https://arxiv.org/abs/2411.04459", 306 | "venue": "arXiv", 307 | "date": "2024-11", 308 | "label": "chen2024gptmcts" 309 | }, 310 | { 311 | "paper": "CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models", 312 | "link": "https://arxiv.org/abs/2411.04329", 313 | "venue": "arXiv", 314 | "date": "2024-11", 315 | "label": "liu2024codetree" 316 | }, 317 | { 318 | "paper": "Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination", 319 | "link": "https://arxiv.org/abs/2410.17820", 320 | "venue": "arXiv", 321 | "date": "2024-10", 322 | "label": "wang2024treeofthoughts" 323 | }, 324 | { 325 | "paper": "TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling", 326 | "link": "https://arxiv.org/abs/2410.16033", 327 | "venue": "arXiv", 328 | "date": "2024-10", 329 | "label": "zhou2024treebon" 330 | }, 331 | { 332 | "paper": "Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning", 333 | "link": "https://arxiv.org/abs/2410.06508", 334 | "venue": "arXiv", 335 | "date": "2024-10", 336 | "label": "li2024selfimprove" 337 | }, 338 | { 339 | "paper": "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning", 340 | "link": "https://arxiv.org/abs/2410.02884", 341 | "venue": "arXiv", 342 | "date": "2024-10", 343 | "label": "yang2024llamaberry" 344 | }, 345 | { 346 | "paper": "Interpretable Contrastive Monte Carlo Tree Search Reasoning", 347 | "link": "https://arxiv.org/abs/2410.01707", 348 | "venue": "arXiv", 349 | "date": "2024-10", 350 | "label": "hu2024interpretablemcts" 351 | }, 352 | { 353 | "paper": "MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time", 354 | "link": "https://arxiv.org/abs/2405.16265", 355 | "venue": "arXiv", 356 | "date": "2024-05", 357 | "label": "zhang2024mindstar" 358 | }, 359 | { 360 | "paper": "RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation", 361 | "link": "https://arxiv.org/abs/2409.09584", 362 | "venue": "arXiv", 363 | "date": "2024-09", 364 | "label": "li2024rethinkmcts" 365 | }, 366 | { 367 | "paper": "Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search", 368 | "link": "https://arxiv.org/abs/2408.10635", 369 | "venue": "arXiv", 370 | "date": "2024-08", 371 | "label": "zhou2024strategist" 372 | }, 373 | { 374 | "paper": "Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search", 375 | "link": "https://arxiv.org/abs/2405.15383", 376 | "venue": "arXiv", 377 | "date": "2024-05", 378 | "label": "chen2024codeworldmcts" 379 | }, 380 | { 381 | "paper": "Uncertainty-Guided Optimization on Large Language Model Search Trees", 382 | "link": "https://arxiv.org/abs/2407.03951", 383 | "venue": "arXiv", 384 | "date": "2024-07", 385 | "label": "yang2024uncertaintymcts" 386 | }, 387 | { 388 | "paper": "Tree Search for Language Model Agents", 389 | "link": "https://arxiv.org/abs/2407.01476", 390 | "venue": "arXiv", 391 | "date": "2024-07", 392 | "label": "wu2024treesearchlm" 393 | }, 394 | { 395 | "paper": "LiteSearch: Efficacious Tree Search for LLM", 396 | "link": "https://arxiv.org/abs/2407.00320", 397 | "venue": "arXiv", 398 | "date": "2024-07", 399 | "label": "chen2024litesearch" 400 | }, 401 | { 402 | "paper": "Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B", 403 | "link": "https://arxiv.org/abs/2406.07394", 404 | "venue": "arXiv", 405 | "date": "2024-06", 406 | "label": "liu2024accessing" 407 | }, 408 | { 409 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search", 410 | "link": "https://arxiv.org/abs/2406.03816", 411 | "venue": "NeurIPS", 412 | "date": "2024-12", 413 | "label": "wang2024restmcts" 414 | }, 415 | { 416 | "paper": "On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes", 417 | "link": "https://ieeexplore.ieee.org/abstract/document/10870057/", 418 | "venue": "IEEE TAC", 419 | "date": "2025-01", 420 | "label": "zhang2025mcts" 421 | }, 422 | { 423 | "paper": "AlphaMath Almost Zero: process Supervision without process", 424 | "link": "https://arxiv.org/abs/2405.03553", 425 | "venue": "arXiv", 426 | "date": "2024-05", 427 | "label": "chen2024alphamath" 428 | }, 429 | { 430 | "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning", 431 | "link": "https://arxiv.org/abs/2405.00451", 432 | "venue": "arXiv", 433 | "date": "2024-05", 434 | "label": "liu2024mctsboost" 435 | }, 436 | { 437 | "paper": "Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping", 438 | "link": "https://openreview.net/forum?id=rviGTsl0oy", 439 | "venue": "ICLR WorkShop", 440 | "date": "2024-05", 441 | "label": "wang2024beyonda" 442 | }, 443 | { 444 | "paper": "Stream of Search (SoS): Learning to Search in Language", 445 | "link": "https://arxiv.org/abs/2404.03683", 446 | "venue": "arXiv", 447 | "date": "2024-04", 448 | "label": "yang2024sos" 449 | }, 450 | { 451 | "paper": "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models", 452 | "link": "https://openreview.net/forum?id=h1mvwbQiXR", 453 | "venue": "ICLR WorkShop", 454 | "date": "2024-05", 455 | "label": "zhang2024llmreasoners" 456 | }, 457 | { 458 | "paper": "Search-o1: Agentic Search-Enhanced Large Reasoning Models", 459 | "link": "https://arxiv.org/abs/2501.05366", 460 | "venue": "arXiv", 461 | "date": "2025-01", 462 | "label": "li2025searcho1" 463 | }, 464 | { 465 | "paper": "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking", 466 | "link": "https://arxiv.org/abs/2501.04519", 467 | "venue": "arXiv", 468 | "date": "2025-01", 469 | "label": "chen2025rstar" 470 | }, 471 | { 472 | "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs", 473 | "link": "https://arxiv.org/abs/2412.18925", 474 | "venue": "arXiv", 475 | "date": "2024-12", 476 | "label": "zhang2024huatuo" 477 | }, 478 | { 479 | "paper": "AFlow: Automating Agentic Workflow Generation", 480 | "link": "https://arxiv.org/abs/2410.10762", 481 | "venue": "arXiv", 482 | "date": "2024-10", 483 | "label": "wang2024aflow" 484 | }, 485 | { 486 | "paper": "MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING", 487 | "link": "https://arxiv.org/abs/2309.15028", 488 | "venue": "arXiv", 489 | "date": "2023-09", 490 | "label": "liu2023ppo" 491 | }, 492 | { 493 | "paper": "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", 494 | "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html", 495 | "venue": "NeurIPS", 496 | "date": "2023-12", 497 | "label": "wang2023commonsense" 498 | }, 499 | { 500 | "paper": "ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING", 501 | "link": "https://openreview.net/forum?id=PJfc4x2jXY", 502 | "venue": "NeurIPS WorkShop", 503 | "date": "2023-12", 504 | "label": "li2023alphazero" 505 | }, 506 | { 507 | "paper": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing", 508 | "link": "https://arxiv.org/abs/2404.12253", 509 | "venue": "arXiv", 510 | "date": "2024-04", 511 | "label": "chen2024selfimprove" 512 | } 513 | ], 514 | "Part 5: Self-Training / Self-Improve": [ 515 | { 516 | "paper": "STaR: Bootstrapping Reasoning With Reasoning", 517 | "link": "https://arxiv.org/abs/2203.14465", 518 | "venue": "NeurIPS2022", 519 | "date": "2022-05", 520 | "label": "zelikman2022star" 521 | }, 522 | { 523 | "paper": "ReST: Reinforced Self-Training for Language Modeling", 524 | "link": "https://arxiv.org/abs/2308.08998", 525 | "venue": "arXiv", 526 | "date": "2023-08", 527 | "label": "gulcehre2023rest" 528 | }, 529 | { 530 | "paper": "ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models", 531 | "link": "https://openreview.net/forum?id=lNAyUngGFK", 532 | "venue": "TMLR", 533 | "date": "2024-09", 534 | "label": "yang2024restem" 535 | }, 536 | { 537 | "paper": "Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search", 538 | "link": "https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html", 539 | "venue": "NeurIPS", 540 | "date": "2017-12", 541 | "label": "anthony2017expert" 542 | }, 543 | { 544 | "paper": "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking", 545 | "link": "https://arxiv.org/abs/2403.09629", 546 | "venue": "arXiv", 547 | "date": "2024-03", 548 | "label": "zelikman2024quietstar" 549 | }, 550 | { 551 | "paper": "V-star: Training Verifiers for Self-Taught Reasoners", 552 | "link": "https://arxiv.org/abs/2402.06457", 553 | "venue": "arXiv", 554 | "date": "2024-02", 555 | "label": "liu2024vstar" 556 | }, 557 | { 558 | "paper": "Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models", 559 | "link": "https://arxiv.org/abs/2406.11736", 560 | "venue": "arXiv", 561 | "date": "2024-06", 562 | "label": "wang2024interactive" 563 | }, 564 | { 565 | "paper": "ReFT: Representation Finetuning for Language Models", 566 | "link": "https://aclanthology.org/2024.acl-long.410.pdf", 567 | "venue": "ACL", 568 | "date": "2024-08", 569 | "label": "wu2024reft" 570 | }, 571 | { 572 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search", 573 | "link": "https://arxiv.org/abs/2406.03816", 574 | "venue": "NeurIPS", 575 | "date": "2024-12", 576 | "label": "chen2024restmcts" 577 | }, 578 | { 579 | "paper": "Recursive Introspection: Teaching Language Model Agents How to Self-Improve", 580 | "link": "https://openreview.net/forum?id=DRC9pZwBwR", 581 | "venue": "NeurIPS", 582 | "date": "2024-12", 583 | "label": "zhang2024recursive" 584 | }, 585 | { 586 | "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner", 587 | "link": "https://arxiv.org/abs/2412.17256", 588 | "venue": "arXiv", 589 | "date": "2024-12", 590 | "label": "he2024bstar" 591 | }, 592 | { 593 | "paper": "Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)", 594 | "link": "https://arxiv.org/abs/2501.04519", 595 | "venue": "arXiv", 596 | "date": "2025-01", 597 | "label": "xu2025rstar" 598 | }, 599 | { 600 | "paper": "Enhancing Large Vision Language Models with Self-Training on Image Comprehension", 601 | "link": "https://arxiv.org/abs/2405.19716", 602 | "venue": "arXiv", 603 | "date": "2024-05", 604 | "label": "li2024enhancing" 605 | }, 606 | { 607 | "paper": "Self-Refine: Iterative Refinement with Self-Feedback", 608 | "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html", 609 | "venue": "NeurIPS", 610 | "date": "2023-12", 611 | "label": "madaan2023selfrefine" 612 | }, 613 | { 614 | "paper": "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing", 615 | "link": "https://openreview.net/forum?id=Sx038qxjek", 616 | "venue": "ICLR", 617 | "date": "2024-05", 618 | "label": "zhang2024critic" 619 | } 620 | ], 621 | "Part 6: Reflection": [ 622 | { 623 | "paper": "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers", 624 | "link": "https://arxiv.org/abs/2408.06195", 625 | "venue": "arXiv", 626 | "date": "2024-08", 627 | "label": "zheng2024mutual" 628 | }, 629 | { 630 | "paper": "Reflection-Tuning: An Approach for Data Recycling", 631 | "link": "https://arxiv.org/abs/2310.11716", 632 | "venue": "arXiv", 633 | "date": "2023-10", 634 | "label": "li2024reflection" 635 | }, 636 | { 637 | "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection", 638 | "link": "https://arxiv.org/abs/2411.00855", 639 | "venue": "arXiv", 640 | "date": "2024-11", 641 | "label": "cheng2024vision" 642 | }, 643 | { 644 | "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs", 645 | "link": "https://arxiv.org/abs/2412.18925", 646 | "venue": "arXiv", 647 | "date": "2024-12", 648 | "label": "zhang2024huatuo" 649 | }, 650 | { 651 | "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning", 652 | "link": "https://arxiv.org/abs/2411.11930", 653 | "venue": "arXiv", 654 | "date": "2024-11", 655 | "label": "liu2024atomthink" 656 | }, 657 | { 658 | "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step", 659 | "link": "https://arxiv.org/abs/2411.10440", 660 | "venue": "arXiv", 661 | "date": "2024-11", 662 | "label": "xu2024llava" 663 | } 664 | ], 665 | "Part 7: Efficient System2": [ 666 | { 667 | "paper": "Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking", 668 | "link": "https://arxiv.org/abs/2501.01306", 669 | "venue": "arXiv", 670 | "date": "2025-01", 671 | "label": "cheng2025think" 672 | }, 673 | { 674 | "paper": "Token-Budget-Aware LLM Reasoning", 675 | "link": "https://arxiv.org/abs/2412.18547", 676 | "venue": "arXiv", 677 | "date": "2024-12", 678 | "label": "wang2024token" 679 | }, 680 | { 681 | "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner", 682 | "link": "https://arxiv.org/abs/2412.17256", 683 | "venue": "arXiv", 684 | "date": "2024-12", 685 | "label": "he2024bstar" 686 | }, 687 | { 688 | "paper": "Guiding Language Model Reasoning with Planning Tokens", 689 | "link": "https://arxiv.org/abs/2310.05707", 690 | "venue": "CoLM", 691 | "date": "2024-10", 692 | "label": "wang2023guiding" 693 | }, 694 | { 695 | "paper": "DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models", 696 | "link": "https://arxiv.org/abs/2407.01009", 697 | "venue": "EMNLP", 698 | "date": "2024-12", 699 | "label": "li2024dynathink" 700 | }, 701 | { 702 | "paper": "Training Large Language Models to Reason in a Continuous Latent Space", 703 | "link": "https://arxiv.org/abs/2412.06769", 704 | "venue": "arXiv", 705 | "date": "2024-12", 706 | "label": "chen2024training" 707 | }, 708 | { 709 | "paper": "O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning", 710 | "link": "https://arxiv.org/abs/2501.12570", 711 | "venue": "arXiv", 712 | "date": "2025-01", 713 | "label": "zhang2025o1pruner" 714 | } 715 | ], 716 | "Part 8: Explainability": [ 717 | { 718 | "paper": "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective", 719 | "link": "https://arxiv.org/abs/2410.23743", 720 | "venue": "arXiv", 721 | "date": "2024-10", 722 | "label": "li2024whathappened" 723 | }, 724 | { 725 | "paper": "When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1", 726 | "link": "https://arxiv.org/abs/2410.01792", 727 | "venue": "arXiv", 728 | "date": "2024-10", 729 | "label": "zhang2024embers" 730 | }, 731 | { 732 | "paper": "Agents Thinking Fast and Slow: A Talker-Reasoner Architecture", 733 | "link": "https://openreview.net/forum?id=xPhcP6rbI4", 734 | "venue": "NeurIPS WorkShop", 735 | "date": "2024-12", 736 | "label": "wang2024agents" 737 | }, 738 | { 739 | "paper": "System 2 Attention (is something you might need too)", 740 | "link": "https://arxiv.org/abs/2311.11829", 741 | "venue": "arXiv", 742 | "date": "2023-11", 743 | "label": "chen2023system2" 744 | }, 745 | { 746 | "paper": "Distilling System 2 into System 1", 747 | "link": "https://arxiv.org/abs/2407.06023", 748 | "venue": "arXiv", 749 | "date": "2024-07", 750 | "label": "liu2024distilling" 751 | }, 752 | { 753 | "paper": "The Impact of Reasoning Step Length on Large Language Models", 754 | "link": "https://arxiv.org/abs/2401.04925", 755 | "venue": "ACL Findings", 756 | "date": "2024-08", 757 | "label": "sun2024impact" 758 | } 759 | ], 760 | "Part 9: Multimodal Agent related Slow-Fast System": [ 761 | { 762 | "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning", 763 | "link": "https://arxiv.org/abs/2411.11930", 764 | "venue": "arXiv", 765 | "date": "2024-11", 766 | "label": "liu2024atomthink" 767 | }, 768 | { 769 | "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step", 770 | "link": "https://arxiv.org/abs/2411.10440", 771 | "venue": "arXiv", 772 | "date": "2024-11", 773 | "label": "xu2024llava" 774 | }, 775 | { 776 | "paper": "Visual Agents as Fast and Slow Thinkers", 777 | "link": "https://openreview.net/forum?id=ncCuiD3KJQ", 778 | "venue": "ICLR", 779 | "date": "2025-01", 780 | "label": "gao2025visualagents" 781 | }, 782 | { 783 | "paper": "Slow Perception: Let's Perceive Geometric Figures Step-by-Step", 784 | "link": "https://arxiv.org/abs/2412.20631", 785 | "venue": "arXiv", 786 | "date": "2024-12", 787 | "label": "wei2024slow" 788 | }, 789 | { 790 | "paper": "Virgo: A Preliminary Exploration on Reproducing o1-like MLLM", 791 | "link": "https://arxiv.org/abs/2501.01904", 792 | "venue": "arXiv", 793 | "date": "2025-01", 794 | "label": "du2025virgo" 795 | }, 796 | { 797 | "paper": "Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension", 798 | "link": "https://arxiv.org/pdf/2412.03704", 799 | "venue": "arXiv", 800 | "date": "2024-12", 801 | "label": "feng2024scaling" 802 | }, 803 | { 804 | "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection", 805 | "link": "https://arxiv.org/abs/2411.00855", 806 | "venue": "arXiv", 807 | "date": "2024-11", 808 | "label": "cheng2024vision" 809 | }, 810 | { 811 | "paper": "Diving into Self-Evolving Training for Multimodal Reasoning", 812 | "link": "https://arxiv.org/abs/2412.17451", 813 | "venue": "ICLR", 814 | "date": "2025-01", 815 | "label": "zhao2024selfevolving" 816 | } 817 | ], 818 | "Part 10: Benchmark and Datasets": [ 819 | { 820 | "paper": "A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?", 821 | "link": "https://arxiv.org/abs/2409.15277", 822 | "venue": "arXiv", 823 | "date": "2024-09", 824 | "label": "tu2024preliminary" 825 | }, 826 | { 827 | "paper": "MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs", 828 | "link": "https://openreview.net/forum?id=GN2qbxZlni", 829 | "venue": "NeurIPS", 830 | "date": "2024-12", 831 | "label": "li2024mrben" 832 | }, 833 | { 834 | "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models", 835 | "link": "https://arxiv.org/abs/2501.03124", 836 | "venue": "arXiv", 837 | "date": "2025-01", 838 | "label": "song2025prmbench" 839 | }, 840 | { 841 | "paper": "Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs", 842 | "link": "https://arxiv.org/abs/2412.21187", 843 | "venue": "arXiv", 844 | "date": "2024-12", 845 | "label": "huang2024overthinking" 846 | } 847 | ] 848 | } -------------------------------------------------------------------------------- /assets/timeline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.jpg -------------------------------------------------------------------------------- /assets/timeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.png -------------------------------------------------------------------------------- /assets/timeline_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline_2.png -------------------------------------------------------------------------------- /src/list.md: -------------------------------------------------------------------------------- 1 | ## Part 1: O1 Replication 2 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) ![](https://img.shields.io/badge/arXiv-2024.12-red) 3 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) ![](https://img.shields.io/badge/arXiv-2024.12-red) 4 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) ![](https://img.shields.io/badge/arXiv-2024.11-red) 5 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red) 6 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) ![](https://img.shields.io/badge/arXiv-2024.11-red) 7 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) ![](https://img.shields.io/badge/arXiv-2024.10-red) 8 | ## Part 2: Process Reward Models 9 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red) 10 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) ![](https://img.shields.io/badge/arXiv-2025.01-red) 11 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) ![](https://img.shields.io/badge/arXiv-2025.01-red) 12 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) ![](https://img.shields.io/badge/arXiv-2025.01-red) 13 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 14 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 15 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) ![](https://img.shields.io/badge/arXiv-2024.12-red) 16 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) ![](https://img.shields.io/badge/arXiv-2024.12-red) 17 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) ![](https://img.shields.io/badge/ACL-2024-blue) 18 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) ![](https://img.shields.io/badge/ACL_Findings-2024-blue) 19 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) ![](https://img.shields.io/badge/arXiv-2024.06-red) 20 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) ![](https://img.shields.io/badge/arXiv-2024.05-red) 21 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) ![](https://img.shields.io/badge/arXiv-2023.06-red) 22 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) ![](https://img.shields.io/badge/arXiv-2023.06-red) 23 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) ![](https://img.shields.io/badge/arXiv-2022.11-red) 24 | ## Part 3: Reinforcement Learning 25 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) ![](https://img.shields.io/badge/arXiv-2025.02-red) 26 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) ![](https://img.shields.io/badge/arXiv-2025.01-red) 27 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) ![](https://img.shields.io/badge/arXiv-2025.01-red) 28 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) ![](https://img.shields.io/badge/arXiv-2025.01-red) 29 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) ![](https://img.shields.io/badge/arXiv-2025.01-red) 30 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) ![](https://img.shields.io/badge/arXiv-2024.12-red) 31 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) ![](https://img.shields.io/badge/arXiv-2024.12-red) 32 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue) 33 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) ![](https://img.shields.io/badge/arXiv-2024.02-red) 34 | ## Part 4: MCTS/Tree Search 35 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) ![](https://img.shields.io/badge/IEEE_TAC-2025-blue) 36 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) ![](https://img.shields.io/badge/arXiv-2025.01-red) 37 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red) 38 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red) 39 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) ![](https://img.shields.io/badge/arXiv-2024.12-red) 40 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red) 41 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) ![](https://img.shields.io/badge/arXiv-2024.12-red) 42 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) ![](https://img.shields.io/badge/arXiv-2024.12-red) 43 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) ![](https://img.shields.io/badge/arXiv-2024.12-red) 44 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) ![](https://img.shields.io/badge/arXiv-2024.12-red) 45 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) ![](https://img.shields.io/badge/arXiv-2024.11-red) 46 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) ![](https://img.shields.io/badge/arXiv-2024.11-red) 47 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) ![](https://img.shields.io/badge/arXiv-2024.11-red) 48 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) ![](https://img.shields.io/badge/arXiv-2024.11-red) 49 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) ![](https://img.shields.io/badge/arXiv-2024.11-red) 50 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) ![](https://img.shields.io/badge/CoLM-2024-blue) 51 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) ![](https://img.shields.io/badge/arXiv-2024.10-red) 52 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) ![](https://img.shields.io/badge/arXiv-2024.10-red) 53 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) ![](https://img.shields.io/badge/arXiv-2024.10-red) 54 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) ![](https://img.shields.io/badge/arXiv-2024.10-red) 55 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) ![](https://img.shields.io/badge/arXiv-2024.10-red) 56 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) ![](https://img.shields.io/badge/arXiv-2024.10-red) 57 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) ![](https://img.shields.io/badge/arXiv-2024.09-red) 58 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) ![](https://img.shields.io/badge/arXiv-2024.08-red) 59 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) ![](https://img.shields.io/badge/arXiv-2024.07-red) 60 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) ![](https://img.shields.io/badge/arXiv-2024.07-red) 61 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) ![](https://img.shields.io/badge/arXiv-2024.07-red) 62 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) ![](https://img.shields.io/badge/arXiv-2024.06-red) 63 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 64 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) ![](https://img.shields.io/badge/ICLR_WorkShop-2024-blue) 65 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) ![](https://img.shields.io/badge/arXiv-2024.05-red) 66 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) ![](https://img.shields.io/badge/arXiv-2024.05-red) 67 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) ![](https://img.shields.io/badge/arXiv-2024.05-red) 68 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red) 69 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) ![](https://img.shields.io/badge/arXiv-2024.05-red) 70 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) ![](https://img.shields.io/badge/arXiv-2024.04-red) 71 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) ![](https://img.shields.io/badge/arXiv-2024.04-red) 72 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) ![](https://img.shields.io/badge/EMNLP-2023-blue) 73 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 74 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue) 75 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2023-blue) 76 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) ![](https://img.shields.io/badge/arXiv-2023.09-red) 77 | ## Part 5: Self-Training / Self-Improve 78 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) ![](https://img.shields.io/badge/arXiv-2025.01-red) 79 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) ![](https://img.shields.io/badge/arXiv-2024.12-red) 80 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 81 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red) 82 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) ![](https://img.shields.io/badge/TMLR-2024-blue) 83 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) ![](https://img.shields.io/badge/ACL-2024-blue) 84 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) ![](https://img.shields.io/badge/arXiv-2024.06-red) 85 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) ![](https://img.shields.io/badge/ICLR-2024-blue) 86 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) ![](https://img.shields.io/badge/arXiv-2024.05-red) 87 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) ![](https://img.shields.io/badge/arXiv-2024.03-red) 88 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) ![](https://img.shields.io/badge/arXiv-2024.02-red) 89 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) ![](https://img.shields.io/badge/NeurIPS-2023-blue) 90 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) ![](https://img.shields.io/badge/arXiv-2023.08-red) 91 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) ![](https://img.shields.io/badge/arXiv-2022.05-red) 92 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) ![](https://img.shields.io/badge/NeurIPS-2017-blue) 93 | ## Part 6: Reflection 94 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) ![](https://img.shields.io/badge/arXiv-2024.12-red) 95 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red) 96 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red) 97 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red) 98 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) ![](https://img.shields.io/badge/arXiv-2024.08-red) 99 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) ![](https://img.shields.io/badge/arXiv-2023.10-red) 100 | ## Part 7: Efficient System2 101 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) ![](https://img.shields.io/badge/arXiv-2025.01-red) 102 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) ![](https://img.shields.io/badge/arXiv-2025.01-red) 103 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) ![](https://img.shields.io/badge/arXiv-2024.12-red) 104 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) ![](https://img.shields.io/badge/arXiv-2024.12-red) 105 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) ![](https://img.shields.io/badge/arXiv-2024.12-red) 106 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) ![](https://img.shields.io/badge/arXiv-2024.12-red) 107 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) ![](https://img.shields.io/badge/arXiv-2024.10-red) 108 | ## Part 8: Explainability 109 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) ![](https://img.shields.io/badge/NeurIPS_WorkShop-2024-blue) 110 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) ![](https://img.shields.io/badge/arXiv-2024.10-red) 111 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) ![](https://img.shields.io/badge/arXiv-2024.10-red) 112 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) ![](https://img.shields.io/badge/arXiv-2024.08-red) 113 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) ![](https://img.shields.io/badge/arXiv-2024.07-red) 114 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) ![](https://img.shields.io/badge/arXiv-2023.11-red) 115 | ## Part 9: Multimodal Agent related Slow-Fast System 116 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) ![](https://img.shields.io/badge/arXiv-2025.01-red) 117 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) ![](https://img.shields.io/badge/ICLR-2025-blue) 118 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) ![](https://img.shields.io/badge/arXiv-2025.01-red) 119 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) ![](https://img.shields.io/badge/arXiv-2024.12-red) 120 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) ![](https://img.shields.io/badge/arXiv-2024.12-red) 121 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) ![](https://img.shields.io/badge/arXiv-2024.11-red) 122 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) ![](https://img.shields.io/badge/arXiv-2024.11-red) 123 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) ![](https://img.shields.io/badge/arXiv-2024.11-red) 124 | ## Part 10: Benchmark and Datasets 125 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) ![](https://img.shields.io/badge/arXiv-2025.01-red) 126 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) ![](https://img.shields.io/badge/NeurIPS-2024-blue) 127 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) ![](https://img.shields.io/badge/arXiv-2024.12-red) 128 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) ![](https://img.shields.io/badge/arXiv-2024.09-red) 129 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from typing import Optional 4 | 5 | 6 | class PaperInformation: 7 | def __init__( 8 | self, paper: str, link: str, venue: str, date: str, label: Optional[str] = None 9 | ): 10 | self.paper = paper 11 | self.link = link 12 | self.venue = venue 13 | self.date = date 14 | self.label = label 15 | 16 | def __hash__(self): 17 | return self.label 18 | 19 | def __lt__(self, other: "PaperInformation"): 20 | self_year, self_month = map(int, self.date.split("-")) 21 | other_year, other_month = map(int, other.date.split("-")) 22 | 23 | if self_year != other_year: 24 | return self_year > other_year # Reverse logic to sort DESCENDING 25 | if self_month != other_month: 26 | return self_month > other_month # Reverse logic to sort DESCENDING 27 | if self.venue != other.venue: 28 | return self.venue < other.venue # Sort venues in descending order 29 | return self.paper < other.paper # Sort titles in descending order 30 | 31 | def __eq__(self, other: "PaperInformation"): 32 | return self.label == other.label 33 | 34 | 35 | class Utility: 36 | @staticmethod 37 | def get_paper_information(raw_paper_dict: dict) -> list[PaperInformation]: 38 | paper_information_list = [] 39 | for key, value in raw_paper_dict.items(): 40 | if isinstance(value, dict): 41 | paper_information_list.extend(Utility.get_paper_information(value)) 42 | elif isinstance(value, list): 43 | for raw_paper_information in value: 44 | if len(raw_paper_information.keys()) == 1: 45 | assert "label" in raw_paper_information.keys() 46 | else: 47 | paper_label = raw_paper_information.get("label", None) 48 | paper_information = PaperInformation( 49 | paper=raw_paper_information["paper"], 50 | link=raw_paper_information["link"], 51 | venue=raw_paper_information["venue"], 52 | date=raw_paper_information["date"], 53 | label=paper_label, 54 | ) 55 | assert ( 56 | paper_label is not None 57 | or paper_information not in paper_information_list 58 | ) 59 | paper_information_list.append(paper_information) 60 | else: 61 | raise TypeError(f"Unexpected type: {type(value)}") 62 | return paper_information_list 63 | 64 | @staticmethod 65 | def fill_paper_dict( 66 | raw_paper_dict: dict, paper_information_list: list[PaperInformation] 67 | ) -> dict: 68 | processed_paper_dict = {} 69 | for key, value in raw_paper_dict.items(): 70 | if isinstance(value, dict): 71 | processed_paper_dict[key] = Utility.fill_paper_dict( 72 | value, paper_information_list 73 | ) 74 | elif isinstance(value, list): 75 | processed_paper_dict[key] = [] 76 | for raw_paper_information in value: 77 | if ( 78 | len(raw_paper_information.keys()) == 1 79 | or "label" in raw_paper_information.keys() 80 | ): 81 | paper_label = raw_paper_information["label"] 82 | for paper_information in paper_information_list: 83 | if paper_information.label == paper_label: 84 | break 85 | else: 86 | raise ValueError(f"Paper label not found: {paper_label}") 87 | processed_paper_dict[key].append(paper_information) 88 | else: 89 | processed_paper_dict[key].append( 90 | PaperInformation( 91 | paper=raw_paper_information["paper"], 92 | link=raw_paper_information["link"], 93 | venue=raw_paper_information["venue"], 94 | date=raw_paper_information["date"], 95 | ) 96 | ) 97 | else: 98 | raise TypeError(f"Unexpected type: {type(value)}") 99 | return processed_paper_dict 100 | 101 | @staticmethod 102 | def generate_title_with_level(title: str, title_level: int) -> str: 103 | return f"{'#' * (title_level + 2)} {title}\n" 104 | 105 | @staticmethod 106 | def generate_readme_table_with_title( 107 | title: str, title_level: int, paper_information_list: list[PaperInformation] 108 | ) -> str: 109 | result_str = Utility.generate_title_with_level(title, title_level) 110 | result_str += "|Title|Venue|Date|\n" 111 | result_str += "|:---|:---|:---|\n" 112 | paper_information_list.sort() 113 | for paper_information in paper_information_list: 114 | result_str += ( 115 | f"|[{paper_information.paper}]({paper_information.link})|" 116 | f"{paper_information.venue}|" 117 | f"{paper_information.date}|\n" 118 | ) 119 | return result_str 120 | 121 | @staticmethod 122 | def generate_all_table( 123 | paper_dict: dict, topmost_table_level: int, current_table_str: str 124 | ) -> str: 125 | for key, value in paper_dict.items(): 126 | if isinstance(value, dict): 127 | current_table_str += Utility.generate_title_with_level( 128 | key, topmost_table_level 129 | ) 130 | current_table_str = Utility.generate_all_table( 131 | value, topmost_table_level + 1, current_table_str 132 | ) 133 | elif isinstance(value, list): 134 | current_table_str += Utility.generate_readme_table_with_title( 135 | key, topmost_table_level, value 136 | ) 137 | else: 138 | raise TypeError(f"Unexpected type: {type(value)}") 139 | return current_table_str 140 | 141 | @staticmethod 142 | def generate_list_with_title( 143 | title: str, title_level: int, paper_information_list: list[PaperInformation] 144 | ) -> str: 145 | result_str = Utility.generate_title_with_level(title, title_level) 146 | for paper_information in paper_information_list: 147 | badge_color = "blue" 148 | if "arxiv" in paper_information.link.lower(): 149 | badge_color = "red" 150 | badge_text = f"arXiv-{paper_information.date.replace('-', '.')}" 151 | else: 152 | venue = paper_information.venue.replace(" ", "_") 153 | year = paper_information.date.split("-")[0] 154 | badge_text = f"{venue}-{year}" 155 | result_str += ( 156 | f"* {paper_information.paper} [[Paper]]({paper_information.link}) " 157 | f"![](https://img.shields.io/badge/{badge_text}-{badge_color})\n" 158 | ) 159 | return result_str 160 | 161 | @staticmethod 162 | def generate_all_list( 163 | paper_dict: dict, topmost_list_level: int, current_list_str: str 164 | ) -> str: 165 | for key, value in paper_dict.items(): 166 | if isinstance(value, dict): 167 | current_list_str += Utility.generate_title_with_level( 168 | key, topmost_list_level 169 | ) 170 | current_list_str = Utility.generate_all_list( 171 | value, topmost_list_level + 1, current_list_str 172 | ) 173 | elif isinstance(value, list): 174 | current_list_str += Utility.generate_list_with_title( 175 | key, topmost_list_level, value 176 | ) 177 | else: 178 | raise TypeError(f"Unexpected type: {type(value)}") 179 | return current_list_str 180 | 181 | 182 | def main(): 183 | raw_paper_dict = json.load(open("./assets/paper.json", "r")) 184 | paper_information_list = Utility.get_paper_information(raw_paper_dict) 185 | processed_paper_dict = Utility.fill_paper_dict( 186 | raw_paper_dict, paper_information_list 187 | ) 188 | all_table_str = Utility.generate_all_table(processed_paper_dict, 0, "") 189 | with open("./src/table.md", "w") as f: 190 | f.write(all_table_str) 191 | all_list_str = Utility.generate_all_list(processed_paper_dict, 0, "") 192 | with open("./src/list.md", "w") as f: 193 | f.write(all_list_str) 194 | 195 | 196 | if __name__ == "__main__": 197 | main() 198 | -------------------------------------------------------------------------------- /src/table.md: -------------------------------------------------------------------------------- 1 | ## Part 1: O1 Replication 2 | |Title|Venue|Date| 3 | |:---|:---|:---| 4 | |[Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems](https://arxiv.org/abs/2412.09413)|arXiv|2024-12| 5 | |[o1-Coder: an o1 Replication for Coding](https://arxiv.org/abs/2412.00154)|arXiv|2024-12| 6 | |[Enhancing LLM Reasoning with Reward-guided Tree Search](https://arxiv.org/abs/2411.11694)|arXiv|2024-11| 7 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11| 8 | |[O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?](https://arxiv.org/abs/2411.16489)|arXiv|2024-11| 9 | |[O1 Replication Journey: A Strategic Progress Report -- Part 1](https://arxiv.org/abs/2410.18982)|arXiv|2024-10| 10 | ## Part 2: Process Reward Models 11 | |Title|Venue|Date| 12 | |:---|:---|:---| 13 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.](https://arxiv.org/abs/2501.03124)|arXiv|2025-01| 14 | |[ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding](https://arxiv.org/abs/2501.07861)|arXiv|2025-01| 15 | |[The Lessons of Developing Process Reward Models in Mathematical Reasoning.](https://arxiv.org/abs/2501.07301)|arXiv|2025-01| 16 | |[ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.](https://arxiv.org/abs/2501.01290)|arXiv|2025-01| 17 | |[AutoPSV: Automated Process-Supervised Verifier](https://openreview.net/forum?id=eOAPWWOGs9)|NeurIPS|2024-12| 18 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://openreview.net/forum?id=8rcFOqEud5)|NeurIPS|2024-12| 19 | |[Free Process Rewards without Process Labels.](https://arxiv.org/abs/2412.01981)|arXiv|2024-12| 20 | |[Outcome-Refining Process Supervision for Code Generation](https://arxiv.org/abs/2412.15118)|arXiv|2024-12| 21 | |[Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https://aclanthology.org/2024.acl-long.510/)|ACL|2024-08| 22 | |[OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning](https://aclanthology.org/2024.findings-naacl.55/)|ACL Findings|2024-08| 23 | |[Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://arxiv.org/abs/2406.18629)|arXiv|2024-06| 24 | |[Let's Verify Step by Step.](https://arxiv.org/abs/2305.20050)|ICLR|2024-05| 25 | |[Improve Mathematical Reasoning in Language Models by Automated Process Supervision](https://arxiv.org/abs/2306.05372)|arXiv|2023-06| 26 | |[Making Large Language Models Better Reasoners with Step-Aware Verifier](https://arxiv.org/abs/2206.02336)|arXiv|2023-06| 27 | |[Solving Math Word Problems with Process and Outcome-Based Feedback](https://arxiv.org/abs/2211.14275)|arXiv|2022-11| 28 | ## Part 3: Reinforcement Learning 29 | |Title|Venue|Date| 30 | |:---|:---|:---| 31 | |[Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://arxiv.org/abs/2502.02508)|arXiv|2025-02| 32 | |[Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://arxiv.org/abs/2501.11651)|arXiv|2025-01| 33 | |[Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies](https://arxiv.org/abs/2501.17030)|arXiv|2025-01| 34 | |[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)|arXiv|2025-01| 35 | |[Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)|arXiv|2025-01| 36 | |[Does RLHF Scale? Exploring the Impacts From Data, Model, and Method](https://arxiv.org/abs/2412.06000)|arXiv|2024-12| 37 | |[Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)|arXiv|2024-12| 38 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08| 39 | |[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)|arXiv|2024-02| 40 | ## Part 4: MCTS/Tree Search 41 | |Title|Venue|Date| 42 | |:---|:---|:---| 43 | |[On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes](https://ieeexplore.ieee.org/abstract/document/10870057/)|IEEE TAC|2025-01| 44 | |[Search-o1: Agentic Search-Enhanced Large Reasoning Models](https://arxiv.org/abs/2501.05366)|arXiv|2025-01| 45 | |[rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://arxiv.org/abs/2501.04519)|arXiv|2025-01| 46 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12| 47 | |[Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning](https://arxiv.org/abs/2412.09078)|arXiv|2024-12| 48 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12| 49 | |[Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search](https://arxiv.org/abs/2412.18319)|arXiv|2024-12| 50 | |[Proposing and solving olympiad geometry with guided tree search](https://arxiv.org/abs/2412.10673)|arXiv|2024-12| 51 | |[SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models](https://arxiv.org/abs/2412.11605)|arXiv|2024-12| 52 | |[Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2412.17397)|arXiv|2024-12| 53 | |[CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models](https://arxiv.org/abs/2411.04329)|arXiv|2024-11| 54 | |[GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection](https://arxiv.org/abs/2411.04459)|arXiv|2024-11| 55 | |[MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree](https://arxiv.org/abs/2411.15645)|arXiv|2024-11| 56 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11| 57 | |[SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2411.11053)|arXiv|2024-11| 58 | |[Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion)|CoLM|2024-10| 59 | |[AFlow: Automating Agentic Workflow Generation](https://arxiv.org/abs/2410.10762)|arXiv|2024-10| 60 | |[Interpretable Contrastive Monte Carlo Tree Search Reasoning](https://arxiv.org/abs/2410.01707)|arXiv|2024-10| 61 | |[LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning](https://arxiv.org/abs/2410.02884)|arXiv|2024-10| 62 | |[Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning](https://arxiv.org/abs/2410.06508)|arXiv|2024-10| 63 | |[TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling](https://arxiv.org/abs/2410.16033)|arXiv|2024-10| 64 | |[Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination](https://arxiv.org/abs/2410.17820)|arXiv|2024-10| 65 | |[RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2409.09584)|arXiv|2024-09| 66 | |[Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search](https://arxiv.org/abs/2408.10635)|arXiv|2024-08| 67 | |[LiteSearch: Efficacious Tree Search for LLM](https://arxiv.org/abs/2407.00320)|arXiv|2024-07| 68 | |[Tree Search for Language Model Agents](https://arxiv.org/abs/2407.01476)|arXiv|2024-07| 69 | |[Uncertainty-Guided Optimization on Large Language Model Search Trees](https://arxiv.org/abs/2407.03951)|arXiv|2024-07| 70 | |[Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B](https://arxiv.org/abs/2406.07394)|arXiv|2024-06| 71 | |[Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping](https://openreview.net/forum?id=rviGTsl0oy)|ICLR WorkShop|2024-05| 72 | |[LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models](https://openreview.net/forum?id=h1mvwbQiXR)|ICLR WorkShop|2024-05| 73 | |[AlphaMath Almost Zero: process Supervision without process](https://arxiv.org/abs/2405.03553)|arXiv|2024-05| 74 | |[Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search](https://arxiv.org/abs/2405.15383)|arXiv|2024-05| 75 | |[MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time](https://arxiv.org/abs/2405.16265)|arXiv|2024-05| 76 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05| 77 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05| 78 | |[Stream of Search (SoS): Learning to Search in Language](https://arxiv.org/abs/2404.03683)|arXiv|2024-04| 79 | |[Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https://arxiv.org/abs/2404.12253)|arXiv|2024-04| 80 | |[Reasoning with Language Model is Planning with World Model](https://aclanthology.org/2023.emnlp-main.507/)|EMNLP|2023-12| 81 | |[Large Language Models as Commonsense Knowledge for Large-Scale Task Planning](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html)|NeurIPS|2023-12| 82 | |[ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12| 83 | |[Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12| 84 | |[MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING](https://arxiv.org/abs/2309.15028)|arXiv|2023-09| 85 | ## Part 5: Self-Training / Self-Improve 86 | |Title|Venue|Date| 87 | |:---|:---|:---| 88 | |[Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)](https://arxiv.org/abs/2501.04519)|arXiv|2025-01| 89 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12| 90 | |[Recursive Introspection: Teaching Language Model Agents How to Self-Improve](https://openreview.net/forum?id=DRC9pZwBwR)|NeurIPS|2024-12| 91 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12| 92 | |[ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://openreview.net/forum?id=lNAyUngGFK)|TMLR|2024-09| 93 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08| 94 | |[Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models](https://arxiv.org/abs/2406.11736)|arXiv|2024-06| 95 | |[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://openreview.net/forum?id=Sx038qxjek)|ICLR|2024-05| 96 | |[Enhancing Large Vision Language Models with Self-Training on Image Comprehension](https://arxiv.org/abs/2405.19716)|arXiv|2024-05| 97 | |[Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](https://arxiv.org/abs/2403.09629)|arXiv|2024-03| 98 | |[V-star: Training Verifiers for Self-Taught Reasoners](https://arxiv.org/abs/2402.06457)|arXiv|2024-02| 99 | |[Self-Refine: Iterative Refinement with Self-Feedback](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)|NeurIPS|2023-12| 100 | |[ReST: Reinforced Self-Training for Language Modeling](https://arxiv.org/abs/2308.08998)|arXiv|2023-08| 101 | |[STaR: Bootstrapping Reasoning With Reasoning](https://arxiv.org/abs/2203.14465)|NeurIPS2022|2022-05| 102 | |[Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html)|NeurIPS|2017-12| 103 | ## Part 6: Reflection 104 | |Title|Venue|Date| 105 | |:---|:---|:---| 106 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12| 107 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11| 108 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11| 109 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11| 110 | |[Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers](https://arxiv.org/abs/2408.06195)|arXiv|2024-08| 111 | |[Reflection-Tuning: An Approach for Data Recycling](https://arxiv.org/abs/2310.11716)|arXiv|2023-10| 112 | ## Part 7: Efficient System2 113 | |Title|Venue|Date| 114 | |:---|:---|:---| 115 | |[O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning](https://arxiv.org/abs/2501.12570)|arXiv|2025-01| 116 | |[Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking](https://arxiv.org/abs/2501.01306)|arXiv|2025-01| 117 | |[DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models](https://arxiv.org/abs/2407.01009)|EMNLP|2024-12| 118 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12| 119 | |[Token-Budget-Aware LLM Reasoning](https://arxiv.org/abs/2412.18547)|arXiv|2024-12| 120 | |[Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/abs/2412.06769)|arXiv|2024-12| 121 | |[Guiding Language Model Reasoning with Planning Tokens](https://arxiv.org/abs/2310.05707)|CoLM|2024-10| 122 | ## Part 8: Explainability 123 | |Title|Venue|Date| 124 | |:---|:---|:---| 125 | |[Agents Thinking Fast and Slow: A Talker-Reasoner Architecture](https://openreview.net/forum?id=xPhcP6rbI4)|NeurIPS WorkShop|2024-12| 126 | |[What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective](https://arxiv.org/abs/2410.23743)|arXiv|2024-10| 127 | |[When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1](https://arxiv.org/abs/2410.01792)|arXiv|2024-10| 128 | |[The Impact of Reasoning Step Length on Large Language Models](https://arxiv.org/abs/2401.04925)|ACL Findings|2024-08| 129 | |[Distilling System 2 into System 1](https://arxiv.org/abs/2407.06023)|arXiv|2024-07| 130 | |[System 2 Attention (is something you might need too)](https://arxiv.org/abs/2311.11829)|arXiv|2023-11| 131 | ## Part 9: Multimodal Agent related Slow-Fast System 132 | |Title|Venue|Date| 133 | |:---|:---|:---| 134 | |[Diving into Self-Evolving Training for Multimodal Reasoning](https://arxiv.org/abs/2412.17451)|ICLR|2025-01| 135 | |[Visual Agents as Fast and Slow Thinkers](https://openreview.net/forum?id=ncCuiD3KJQ)|ICLR|2025-01| 136 | |[Virgo: A Preliminary Exploration on Reproducing o1-like MLLM](https://arxiv.org/abs/2501.01904)|arXiv|2025-01| 137 | |[Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension](https://arxiv.org/pdf/2412.03704)|arXiv|2024-12| 138 | |[Slow Perception: Let's Perceive Geometric Figures Step-by-Step](https://arxiv.org/abs/2412.20631)|arXiv|2024-12| 139 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11| 140 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11| 141 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11| 142 | ## Part 10: Benchmark and Datasets 143 | |Title|Venue|Date| 144 | |:---|:---|:---| 145 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models](https://arxiv.org/abs/2501.03124)|arXiv|2025-01| 146 | |[MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs](https://openreview.net/forum?id=GN2qbxZlni)|NeurIPS|2024-12| 147 | |[Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs](https://arxiv.org/abs/2412.21187)|arXiv|2024-12| 148 | |[A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?](https://arxiv.org/abs/2409.15277)|arXiv|2024-09| 149 | -------------------------------------------------------------------------------- /src/timeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/src/timeline.png --------------------------------------------------------------------------------