├── README.md
├── README_en.md
├── assets
    ├── AM-DeepSeek-R1-Distilled.jpeg
    ├── DeepDistill.png
    ├── Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.png
    ├── Leveraging-Reasoning-Model-Answers-to-Enhance-Non-Reasoning-Model-Capability.png
    ├── Not-All-Correct-Answers-Are-Equal-Why-Your-Distillation-Source-Matters.png
    ├── Think-Twice-DeepSeek-R1.png
    ├── Think-Twice-QwQ.png
    ├── am-thinking-v1-benchmark.png
    ├── am-thinking-v1-results_with_params.jpg
    ├── am_logo.png
    ├── staged-RL-2stage.png
    ├── staged-RL-data-difficulty.png
    └── staged-RL-math-code.png
└── docs
    ├── AM-DeepSeek-R1-Distilled-Dataset.pdf
    ├── AM-Thinking-v1.pdf
    ├── DeepDistill.pdf
    ├── Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.pdf
    ├── How-Difficulty-Aware-Staged-Reinforcement-Learning-Enhances-LLMs-Reasoning-Capabilities-A-Preliminary-Experimental-Study.pdf
    ├── Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability.pdf
    ├── Not All Correct Answers Are Equal- Why Your Distillation Source Matters.pdf
    └── Think-Twice.pdf


/README.md:
--------------------------------------------------------------------------------
  1 | # <img src="assets/am_logo.png" style="vertical-align: middle; width: 35px;"> a-m-models [![Generic badge](https://img.shields.io/badge/🤗-am%20team-green.svg)](https://huggingface.co/a-m-team)
  2 | 
  3 | *Read this in [English](README_en.md).*
  4 | 
  5 | a-m-models 是由 a-m-teams 发起的一个开源项目，致力于对大语言模型（LLMs）以及通用人工智能（AGI）的前沿技术进行深入探索与实践。我们的团队由一群充满热情的研究人员和开发者组成，聚焦于大模型的理论创新、架构设计以及实战应用，旨在逐步逼近通用人工智能（AGI）的实现。本项目旨在开源分享我们在大模型领域的最新研究成果与实践经验，希望能够推动社区对AGI技术的深度交流与共同进步。 
  6 | 
  7 | ## 🔄 最近更新
  8 | 
  9 | * [2025-05-20] 发布技术报告[Not All Correct Answers Are Equal: Why Your Distillation Source Matters](https://arxiv.org/abs/2505.14464)，对比AM-Thinking-v1、Qwen3-235B-A22B与DeepSeek-R1三个模型蒸馏效果，基于AM-Thinking-v1蒸馏训练效果最优，同时分析发现可以根据问题难度调整输出长度。AM-Thinking-v1与Qwen3-235B-A22B两份蒸馏数据已开源。
 10 | 
 11 | * [2025-05-14] 发布技术报告[AM-Thinking-v1: Advancing the Frontier of
 12 | Reasoning at 32B Scale](https://arxiv.org/abs/2505.08311)，结合监督微调与强化学习显著提升模型推理能力，在数学与编程任务上超越 DeepSeek-R1，逼近主流 MoE 模型效果，取得 Dense 32B 开源最优水平。
 13 | 
 14 | * [2025-05-05] 发布技术报告[Exploring the Potential of Offline RL for Reasoning in
 15 | LLMs: A Preliminary Study](https://arxiv.org/abs/2505.02142)，探索了Offline-RL增强模型推理能力的方法，实验结果表明在各项评估指标有一致提升。
 16 | 
 17 | * [2025-04-24] 发布技术报告[DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://arxiv.org/abs/2504.17565)，开源了约4000万条不同能力模型的蒸馏数据集，显著提升基础模型推理能力。
 18 | 
 19 | * [2025-04-13] 更新技术报告[Leveraging Reasoning Model Answers to Enhance Non-Reasoning
 20 | Model Capability](https://arxiv.org/abs/2504.09639)，探索了使用reasoning model提升non-reasoning model表现的方法。
 21 | 
 22 | * [2025-04-01] 更新技术报告 [How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study](https://arxiv.org/abs/2504.00829)，介绍了一种分阶段训练方法，逐步让模型接触更具挑战性的任务，从而提高其推理能力
 23 | 
 24 | * [2025-03-25] 更新技术报告[1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Traning](https://arxiv.org/abs/2503.19633)，开源140万条蒸馏推理数据，复现DeepSeek-R1蒸馏模型效果
 25 | 
 26 | * [2025-03-25] 更新技术报告[Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking](https://arxiv.org/abs/2503.19855)，介绍了一种简单且有效的测试阶段扩展方法——多轮思考，其推动了SOTA模型效果的进一步提升
 27 | 
 28 | ## 📑 研究报告
 29 | 
 30 | ### [Not All Correct Answers Are Equal: Why Your Distillation Source Matters](https://arxiv.org/abs/2505.14464) [![Generic badge](https://img.shields.io/badge/🤗-AM_thinking_v1_distilled-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) [![Generic badge](https://img.shields.io/badge/🤗-AM_Qwen3_distilled-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled)
 31 | 
 32 | 基于AM-Thinking-v1、Qwen3-235B-A22B以及DeepSeek-R1蒸馏了三份推理数据。实验发现基于AM-Thinking-v1蒸馏效果最优，其中**AIME2024 84.3，AIME 2025 72.2, MATH500 98.4, LiveCodeBench 65.9**.
 33 | 
 34 | <img src="assets/Not-All-Correct-Answers-Are-Equal-Why-Your-Distillation-Source-Matters.png" alt="alt text" width="600px">
 35 | 
 36 | 实验发现基于AM-Thinking-v1蒸馏训练的模型，相较Qwen3-235B-A22B蒸馏训练的模型在较简单任务(如MATH500)推理长度更短，在较难任务(如AIME2024 & 2025、LiveCodeBench)推理输出更长。其中基于AM-Thinking-v1与Qwen3-235B-A22B蒸馏数据已开源。
 37 | 
 38 | #### Table: Average generation length (tokens per sample) across reasoning benchmarks
 39 | 
 40 | | Benchmark        | AM-Thinking-v1<sub>Distilled</sub> | Qwen3-235B-A22B<sub>Distilled</sub> | DeepSeek-R1<sub>Distilled</sub> |
 41 | |------------------|-------------------------------------|--------------------------------------|----------------------------------|
 42 | | AIME2024         | 15273.8                            | 13516.4                              | 11853.5                          |
 43 | | AIME2025         | 18199.2                            | 16975.7                              | 13495.9                          |
 44 | | MATH500          | 3495.7                             | 6429.4                               | 3613.0                           |
 45 | | LiveCodeBench    | 23426.9                            | 13576.7                              | 30731                            |
 46 | 
 47 | 
 48 | 
 49 | ### [AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale](https://arxiv.org/abs/2505.08311)[![Generic badge](https://img.shields.io/badge/🤗-AM_thinking_v1-green.svg)](https://huggingface.co/a-m-team/AM-Thinking-v1)
 50 | 
 51 | 当前大多数在推理能力上表现突出的开源语言模型多采用Mixture-of-Experts（MoE）架构，如 Qwen3-235B-A22B 和 Seed1.5-Thinking，尽管在性能上具备优势，但其部署和微调成本较高，不易应用于资源受限场景。相比之下，稠密结构的中等规模模型（如32B）在性能与实用性之间提供了更好的平衡，但相关工作仍相对较少。
 52 | 
 53 | 基于这一动机，我们构建了 **AM-Thinking-v1**. 该模型使用公开数据，通过有监督微调与强化学习相结合的后训练流程优化推理与代码能力。
 54 | 
 55 | <img src="assets/am-thinking-v1-benchmark.png" alt="alt text" width="600px">
 56 | 
 57 | 实验结果显示，AM-Thinking-v1 在多个基准测试中表现优异：**AIME 2024 得分 85.3，AIME 2025 得分 74.4，LiveCodeBench 得分 70.3**，超过 DeepSeek-R1，并接近 MoE 架构的最强模型，是当前Dense 32B最优模型。结果表明，得益于精细的训练流程，32B 规模的开源稠密模型亦可在高难度推理任务中实现竞争性能。
 58 | 
 59 | <img src="assets/am-thinking-v1-results_with_params.jpg" alt="alt text" width="600px">
 60 | 
 61 | 
 62 | ### [Exploring the Potential of Offline RL for Reasoning inLLMs: A Preliminary Study](https://arxiv.org/abs/2505.02142)
 63 | 
 64 | 随着大语言模型（LLMs）在长上下文推理任务中的表现持续提升，当前的主流方法主要依赖在线强化学习（Online RL），然而这些方法通常伴随较高的计算成本和复杂性。相较而言，离线强化学习（Offline RL）方法因其简洁高效而展现出潜在的优势，但在长上下文推理领域却未获得充分探索。
 65 | 
 66 | 针对这一研究空白，本论文探讨了Offline RL方法，尤其是直接偏好优化（Direct Preference Optimization, DPO）及其对输出长度不敏感的变体LD-DPO，在提升LLMs推理能力上的有效性。我们通过广泛的实验，在多个推理基准上验证了这些更为简洁的Offline RL方法能够显著提高模型性能，平均提升达到**3.3%**，其中arena-hard基准测试中提升达到**10.1%**。
 67 | 
 68 | 此外，本研究分析了DPO方法对于输出长度的敏感性，强调在延长推理文本长度时需要关注内容的语义丰富性，而非盲目增加长度，否则可能会对模型性能产生负面影响。
 69 | 
 70 | <img src="assets/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.png" alt="alt text" width="600px">
 71 | 
 72 | 
 73 | ### [DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://arxiv.org/abs/2504.17565)[![Generic badge](https://img.shields.io/badge/🤗-AM_DeepSeek_Distilled_40M-green.svg)](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M)
 74 | 
 75 | 尽管近期大语言模型（LLMs）在复杂推理任务中取得了显著的进展，但对基础模型的训练过程和数据质量的深入理解仍然不足。为解决此问题，我们构建了一个包含约**334万**个不重复问题和**4000万**条由不同能力模型多次蒸馏答案的大规模推理数据集。通过引入通过率（Pass Rate）和变异系数（Coefficient of Variation），我们精准选择具有最高学习潜力的训练数据，以提升推理能力。该数据集已公开在 <https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M>。
 76 | 
 77 | 在AIME2024上，我们的72B模型**仅通过SFT**达到了79.2分；32B模型达到75.8分，进一步退火训练达到77.9分，接近开源最优水平。
 78 | 
 79 | <img src="assets/DeepDistill.png" alt="alt text" width="600px">
 80 | 
 81 | 
 82 | ### [Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability](https://arxiv.org/abs/2504.09639)
 83 | 
 84 | 近期大型语言模型（LLMs）的进展，例如 DeepSeek-R1 和 OpenAI-o1，已展示了test time scaling的显著有效性，在各种基准测试中取得了实质性的性能提升。这些先进模型利用审慎的"思考"步骤系统地提高答案质量。在本文中，我们提出利用这些由reasoning model生成的高质量输出，来改进计算需求较低、非推理的模型。我们探索并比较了利用推理模型产生的答案来训练和改进非推理模型的方法。通过在既定基准上进行监督微调（SFT）实验，我们在各种基准上取得了持续的改进，强调了这种方法在提升non-reasoning model直接回答问题的能力方面的潜力。
 85 | 
 86 | 1. **方法**: 对比了三种利用reasoning-model内容的方法:
 87 |    - **方法1**: 使用原生的non-reasoning model产生的回答；
 88 |    - **方法2**: 使用reasoning model的'answer'部分；
 89 |    - **方法3**: think summarization: 总结reasoning model 的think 部分 和 answer部分拼在一起。
 90 | 
 91 | 2. **结论**: 正确使用reasoning model的回复内容可以增强non-reasoning model的能力，具体效果如图所示。然而，若方法不当，可能导致模型的某些指标下降。因此，在使用过程中需要根据不同场景采用特定策略来提升non-reasoning model的能力。
 92 | 
 93 | <img src="assets/Leveraging-Reasoning-Model-Answers-to-Enhance-Non-Reasoning-Model-Capability.png" alt="alt text" width="600px">
 94 | 
 95 | 
 96 | ### [How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study](https://arxiv.org/abs/2504.00829)[![Generic badge](https://img.shields.io/badge/🤗-AM_Math_Difficulty_RL-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Math-Difficulty-RL)
 97 | 
 98 | 提高大语言模型（LLMs）推理能力的效率和规模是人工智能研究中的一个关键挑战。本文研究了难度感知分阶段强化学习（RL）策略如何提升LLM性能。我们表明，基于难度等级选择训练数据有助于强化学习优化。此外，我们提出了一种分阶段训练方法，逐步让模型接触更具挑战性的任务，从而提高其推理能力。我们的结果还强调了在数学推理和代码生成任务上训练模型的显著好处。
 99 | 
100 | #### 1. 数据难度选择
101 | 
102 | 根据适当的难度指标精心选择RL训练数据至关重要。适中的难度水平能够提高学习效率，平衡充分挑战与避免用过于困难的情境压倒学习过程之间的需求。
103 | 
104 | <img src="assets/staged-RL-data-difficulty.png" alt="alt text" width="600px">
105 | 
106 | #### 2. 分阶段训练
107 | 
108 | 通过选择适当具有挑战性的数据并结合分阶段训练，我们可以显著提高LLM在推理任务上的表现。（由于没加入与代码相关的训练数据，模型在LiveCodeBench上的表现与基础模型基本相同。）
109 | 
110 | <img src="assets/staged-RL-2stage.png" alt="alt text" width="600px">
111 | 
112 | #### 3. 数学和代码的同时训练
113 | 
114 | 在训练过程中混合数学推理和代码生成任务可以带来跨领域的提升，强有力地证明了多领域训练的好处。
115 | 
116 | <img src="assets/staged-RL-math-code.png" alt="alt text" width="600px">
117 | 
118 | ### [Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking](https://arxiv.org/abs/2503.19855)
119 | 
120 | 近年来，以OpenAI-o1和DeepSeek-R1为代表的大语言模型（LLMs）取得了显著进展，这些进展表明，通过测试阶段扩展推理流程（test-time scaling），可显著提升模型表现。然而，目前的模型仍受到处理长文本能力和强化学习（RL）训练效率的限制。为解决这些问题，我们提出了一种简单且有效的测试阶段扩展方法——多轮思考（Multi-round Thinking）。该方法通过将模型先前的回答作为下一轮推理的提示（prompts），迭代地精进模型的推理过程。在包括QwQ-32B和DeepSeek-R1在内的多个模型上的大量实验表明，多轮思考能够在AIME 2024、MATH-500、GPQA-diamond和LiveCodeBench等多个基准测试中稳定提升模型表现。例如，在AIME 2024数据集中，QwQ-32B的准确率从第一轮的80.3%提高到第二轮的82.1%，DeepSeek-R1也表现出了类似的提升，从79.7%提高到82.0%。这些结果证明，多轮思考是一种适用广泛、实施简单且有效提升模型表现的方法，彰显出该方法在未来测试阶段扩展技术发展中的巨大潜力。
121 | 
122 | The key prompt:
123 | ```
124 | Original question prompt
125 | The assistant’s previous answer is: <answer> last round answer </answer>, and please re-answer.
126 | ```
127 | 
128 | <img src="assets/Think-Twice-QwQ.png" alt="alt text" width="600px">
129 | <img src="assets/Think-Twice-DeepSeek-R1.png" alt="alt text" width="600px">
130 | 
131 | 
132 | ### 不同基准测试中单轮思考（第1轮）与多轮思考（第2-4轮）的模型表现对比（pass@1）
133 | 
134 | | **Model**                              | **Round** | **AIME 2024 pass@1** | **MATH500 pass@1** | **GPQA-Diamond pass@1** | **LiveCodeBench pass@1** | **Average** |
135 | |----------------------------------------|-----------|----------------------|--------------------|-------------------------|--------------------------|-------------|
136 | | **Deepseek-R1**                        | 1         | 79.7                 | 97.6               | 74.0                    | 65.3                     | 79.2        |
137 | |                                        | **2**     | **82.0**             | **97.6**           | **74.8**                | **67.1**                 | **80.4**    |
138 | | **QwQ-32B**                            | 1         | 80.3                 | 97.2               | 65.9                    | 63.0                     | 76.6        |
139 | |                                        | 2         | 82.1                 | 97.8               | 67.2                    | 64.7                     | 78.0        |
140 | |                                        | 3         | 82.8                 | 97.8               | 67.5                    | 65.2                     | 78.3        |
141 | |                                        | **4**     | **83.1**             | **97.7**           | **68.1**                | **66.0**                 | **78.7**    |
142 | | **DeepSeek-R1-Distill-Qwen-32B**       | 1         | 72.0                 | 96.0               | 60.1                    | 57.0                     | 71.3        |
143 | |                                        | **2**     | **75.1**             | **96.3**           | **61.3**                | **57.6**                 | **72.6**    |
144 | | **DeepSeek-R1-Distill-Qwen-7B**        | 1         | 56.9                 | 93.4               | 49.2                    | 35.0                     | 58.6        |
145 | |                                        | **2**     | **58.4**             | **93.9**           | **49.4**                | **36.7**                 | **59.6**    |
146 | | **AM-Distill-Qwen-32B**                | 1         | 72.8                 | 96.2               | 62.3                    | 58.3                     | 72.4        |
147 | |                                        | **2**     | **76.7**             | **97.2**           | **62.8**                | **60.2**                 | **74.2**    |
148 | 
149 | ---
150 | ---
151 | 
152 | ### [1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Traning](https://arxiv.org/abs/2503.19633) [![Generic badge](https://img.shields.io/badge/🤗-1.4M-green.svg)](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M)
153 | 
154 | AM-DeepSeek-R1-Distilled 是一个大规模、带有推理过程的通用推理任务数据集，包含大量高质量且具备挑战性的推理问题。这些问题收集自多个开源数据集，经过语义去重和精细清理，以消除可能的测试集污染风险。数据集中所有的答案均由推理模型（主要为 DeepSeek-R1）蒸馏而成，并经过严格的验证流程：数学问题通过与标准答案对比进行验证，代码问题通过测试用例进行核验，而其他类型任务则通过奖励模型进行评估。基于该数据集仅使用简单监督微调（SFT）训练的 AM-Distill-Qwen-32B 模型，在 AIME2024、MATH-500、GPQA-Diamond 以及 LiveCodeBench 四项基准测试上，均超越了DeepSeek-R1-Distill-Qwen-32B 模型。为了推动更强大的推理导向大语言模型（LLMs）发展，我们开源了这140万条问题及其对应的答案。该数据集已公开在 <https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M。>
155 | 
156 | <img src="assets/AM-DeepSeek-R1-Distilled.jpeg" alt="alt text" width="600px">
157 | 
158 | ## Citation
159 | 
160 | 如果您觉得我们的工作对您的研究有所帮助，欢迎给我们点个星 :star:, 并引用我们的工作:pencil:
161 | 
162 | ```BibTeX
163 | @misc{tian2025correctanswersequaldistillation,
164 |       title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters}, 
165 |       author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li},
166 |       year={2025},
167 |       eprint={2505.14464},
168 |       archivePrefix={arXiv},
169 |       primaryClass={cs.CL},
170 |       url={https://arxiv.org/abs/2505.14464}, 
171 | }
172 | 
173 | @misc{ji2025amthinkingv1advancingfrontierreasoning,
174 |       title={AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale}, 
175 |       author={Yunjie Ji and Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Han Zhao and Xiangang Li},
176 |       year={2025},
177 |       eprint={2505.08311},
178 |       archivePrefix={arXiv},
179 |       primaryClass={cs.CL},
180 |       url={https://arxiv.org/abs/2505.08311}, 
181 | }
182 | 
183 | @misc{tian2025exploringpotentialofflinerl,
184 |       title={Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study}, 
185 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
186 |       year={2025},
187 |       eprint={2505.02142},
188 |       archivePrefix={arXiv},
189 |       primaryClass={cs.CL},
190 |       url={https://arxiv.org/abs/2505.02142}, 
191 | }
192 | 
193 | @misc{tian2025deepdistillenhancingllmreasoning,
194 |       title={DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training}, 
195 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
196 |       year={2025},
197 |       eprint={2504.17565},
198 |       archivePrefix={arXiv},
199 |       primaryClass={cs.CL},
200 |       url={https://arxiv.org/abs/2504.17565}, 
201 | }
202 | 
203 | @misc{wang2025leveragingreasoningmodelanswers,
204 |       title={Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability}, 
205 |       author={Haotian Wang and Han Zhao and Shuaiting Chen and Xiaoyu Tian and Sitong Zhao and Yunjie Ji and Yiping Peng and Xiangang Li},
206 |       year={2025},
207 |       eprint={2504.09639},
208 |       archivePrefix={arXiv},
209 |       primaryClass={cs.CL},
210 |       url={https://arxiv.org/abs/2504.09639}, 
211 | }
212 | 
213 | @misc{ji2025difficultyawarestagedreinforcementlearning,
214 |       title={How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study}, 
215 |       author={Yunjie Ji and Sitong Zhao and Xiaoyu Tian and Haotian Wang and Shuaiting Chen and Yiping Peng and Han Zhao and Xiangang Li},
216 |       year={2025},
217 |       eprint={2504.00829},
218 |       archivePrefix={arXiv},
219 |       primaryClass={cs.CL},
220 |       url={https://arxiv.org/abs/2504.00829}, 
221 | }
222 | 
223 | @misc{tian2025thinktwiceenhancingllm,
224 |       title={Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking}, 
225 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yunjie Ji and Yiping Peng and Han Zhao and Xiangang Li},
226 |       year={2025},
227 |       eprint={2503.19855},
228 |       archivePrefix={arXiv},
229 |       primaryClass={cs.CL},
230 |       url={https://arxiv.org/abs/2503.19855}, 
231 | }
232 | 
233 | @misc{zhao202514millionopensourcedistilled,
234 |       title={1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training}, 
235 |       author={Han Zhao and Haotian Wang and Yiping Peng and Sitong Zhao and Xiaoyu Tian and Shuaiting Chen and Yunjie Ji and Xiangang Li},
236 |       year={2025},
237 |       eprint={2503.19633},
238 |       archivePrefix={arXiv},
239 |       primaryClass={cs.CL},
240 |       url={https://arxiv.org/abs/2503.19633}, 
241 | }
242 | 
243 | 
244 | ```
245 | 


--------------------------------------------------------------------------------
/README_en.md:
--------------------------------------------------------------------------------
  1 | # <img src="assets/am_logo.png" style="vertical-align: middle; width: 35px;"> a-m-models [![Generic badge](https://img.shields.io/badge/🤗-am%20team-green.svg)](https://huggingface.co/a-m-team)
  2 | 
  3 | *[中文README](README.md).*
  4 | 
  5 | a-m-models is an open-source initiative led by the a-m-team, dedicated to in-depth exploration and practical application of cutting-edge technologies in Large Language Models (LLMs) and Artificial General Intelligence (AGI). Our team, composed of passionate researchers and developers, focuses on theoretical innovation, architectural design, and practical deployment of large models, aiming to gradually approach the realization of AGI. This project aims to openly share our latest research results and practical experiences in the domain of large models, fostering deeper community exchanges and mutual advancement in AGI technology.
  6 | 
  7 | ## 🔄 Recent Updates
  8 | 
  9 | * [2025-05-20] Published the technical report [Not All Correct Answers Are Equal: Why Your Distillation Source Matters](https://arxiv.org/abs/2505.14464), comparing the distillation effectiveness of three models: AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1. The results show that distillation based on AM-Thinking-v1 yields the best performance. The analysis also reveals that output length can be adjusted according to question difficulty. Distillation datasets for AM-Thinking-v1 and Qwen3-235B-A22B have been open-sourced.
 10 | 
 11 | * [2025-05-14] Released technical report [AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale](https://arxiv.org/abs/2505.08311), which significantly improves reasoning capabilities by combining supervised fine-tuning with reinforcement learning. It surpasses DeepSeek-R1 in math and coding tasks and approaches the performance of mainstream MoE models, achieving state-of-the-art results among dense 32B open-source models.
 12 | 
 13 | * [2025-05-05] Released technical report [Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study](https://arxiv.org/abs/2505.02142), investigating methods for enhancing model reasoning capabilities using Offline RL. Experimental results demonstrate consistent improvements across various evaluation metrics.
 14 | 
 15 | * [2025-04-24] Released the technical report [DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://arxiv.org/abs/2504.17565) and open-sourced a distilled dataset of approximately 40 million samples generated by models of varying capabilities, significantly enhancing foundational model reasoning performance.
 16 | 
 17 | * [2025-04-13] Updated technical report [Leveraging Reasoning Model Answers to Enhance Non-Reasoning
 18 | Model Capability](https://arxiv.org/abs/2504.09639)，We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models.
 19 | 
 20 | * [2025-04-01] Updated technical report [How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study](https://arxiv.org/abs/2504.00829), introduce a staged training approach that gradually exposes models to more challenging tasks, improving their reasoning capabilities.
 21 | 
 22 | * [2025-03-25] Updated technical report [1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training](https://arxiv.org/abs/2503.19633), open-sourced 1.4 million distilled reasoning data entries, reproducing the performance of DeepSeek-R1 distilled models.
 23 | 
 24 | * [2025-03-25] Updated the technical report [Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking](https://arxiv.org/abs/2503.19855), introducing a simple yet effective test-time scaling approach—Multi-round Thinking—which further advances the state-of-the-art model performance
 25 | 
 26 | ## 📑 Research Reports
 27 | 
 28 | ### [Not All Correct Answers Are Equal: Why Your Distillation Source Matters](https://arxiv.org/abs/2505.14464) [![Generic badge](https://img.shields.io/badge/🤗-AM_thinking_v1_distilled-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) [![Generic badge](https://img.shields.io/badge/🤗-AM_Qwen3_distilled-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled)
 29 | 
 30 | Three sets of reasoning data were distilled from AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1. Experiments show that distillation based on AM-Thinking-v1 performs the best, with scores of **AIME2024: 84.3, AIME2025: 72.2, MATH500: 98.4, LiveCodeBench: 65.9**.
 31 | 
 32 | <img src="assets/Not-All-Correct-Answers-Are-Equal-Why-Your-Distillation-Source-Matters.png" alt="alt text" width="600px">
 33 | 
 34 | The results indicate that models trained via distillation from AM-Thinking-v1 produce shorter reasoning outputs on simpler tasks (e.g., MATH500) and longer outputs on more difficult tasks (e.g., AIME2024 & 2025, LiveCodeBench), compared to those distilled from Qwen3-235B-A22B. The distillation datasets for both AM-Thinking-v1 and Qwen3-235B-A22B have been open-sourced.
 35 | 
 36 | #### Table: Average generation length (tokens per sample) across reasoning benchmarks
 37 | 
 38 | | Benchmark        | AM-Thinking-v1<sub>Distilled</sub> | Qwen3-235B-A22B<sub>Distilled</sub> | DeepSeek-R1<sub>Distilled</sub> |
 39 | |------------------|-------------------------------------|--------------------------------------|----------------------------------|
 40 | | AIME2024         | 15,273.8                            | 13,516.4                              | 11,853.5                          |
 41 | | AIME2025         | 18,199.2                            | 16,975.7                              | 13,495.9                          |
 42 | | MATH500          | 3,495.7                             | 6,429.4                               | 3,613.0                           |
 43 | | LiveCodeBench    | 23,426.9                            | 13,576.7                              | 30,731                            |
 44 | 
 45 | ### [AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale](https://arxiv.org/abs/2505.08311)[![Generic badge](https://img.shields.io/badge/🤗-AM_thinking_v1-green.svg)](https://huggingface.co/a-m-team/AM-Thinking-v1)
 46 | 
 47 | Most open-source language models that excel in reasoning tasks adopt a Mixture-of-Experts (MoE) architecture, such as Qwen3-235B-A22B and Seed1.5-Thinking. While these models offer strong performance, their deployment and fine-tuning costs are high, making them less suitable for resource-constrained environments. In contrast, medium-sized dense models (e.g., 32B) offer a better balance between performance and practicality, though such efforts remain relatively scarce.
 48 | 
 49 | Motivated by this, we developed **AM-Thinking-v1**, a model trained on publicly available data using a post-training pipeline that combines supervised fine-tuning and reinforcement learning to enhance reasoning and coding abilities.
 50 | 
 51 | <img src="assets/am-thinking-v1-benchmark.png" alt="alt text" width="600px">
 52 | 
 53 | Experimental results show that AM-Thinking-v1 performs strongly across multiple benchmarks: **AIME 2024 score of 85.3, AIME 2025 score of 74.4, and LiveCodeBench score of 70.3** — surpassing DeepSeek-R1 and approaching the top-performing MoE models. This makes it the best dense 32B model currently available. The results demonstrate that with a carefully designed training pipeline, open-source dense models at the 32B scale can also achieve competitive performance on challenging reasoning tasks.
 54 | 
 55 | <img src="assets/am-thinking-v1-results_with_params.jpg" alt="alt text" width="600px">
 56 | 
 57 | ### [Exploring the Potential of Offline RL for Reasoning inLLMs: A Preliminary Study](https://arxiv.org/abs/2505.02142)
 58 | 
 59 | With continuous improvements in the performance of large language models (LLMs) on long-context reasoning tasks, current mainstream approaches primarily rely on online reinforcement learning (Online RL). However, these methods typically entail high computational costs and complexity. In contrast, offline reinforcement learning (Offline RL) methods offer potential advantages due to their simplicity and efficiency but have remained underexplored in the context of long-context reasoning.
 60 | 
 61 | Addressing this research gap, our paper explores the effectiveness of Offline RL methods, particularly Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, for enhancing the reasoning capabilities of LLMs. Through extensive experiments across multiple reasoning benchmarks, we demonstrate that these simpler Offline RL methods significantly improve model performance, achieving an average enhancement of **3.3%**, with a notable improvement of **10.1%** on the arena-hard benchmark.
 62 | 
 63 | Additionally, our study analyzes the sensitivity of the DPO method to output length, emphasizing the necessity of maintaining semantic richness when extending reasoning text, as indiscriminate lengthening may negatively impact model performance.
 64 | 
 65 | <img src="assets/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.png" alt="alt text" width="600px">
 66 | 
 67 | 
 68 | ### [DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://arxiv.org/abs/2504.17565)[![Generic badge](https://img.shields.io/badge/🤗-AM_DeepSeek_Distilled_40M-green.svg)](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M)
 69 | 
 70 | Despite recent significant advances in large language models (LLMs) on complex reasoning tasks, the training process and data quality of foundational models remain poorly understood. To address this, we constructed a large-scale reasoning dataset containing approximately **3.34 million** unique questions and **40 million** distilled responses generated multiple times by models with varying capabilities. By introducing metrics such as Pass Rate and Coefficient of Variation, we accurately selected training data with the highest learning potential to enhance reasoning abilities. The dataset was published in <https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M>.
 71 | 
 72 | On AIME 2024, our 72B model achieved a score of 79.2 **using only supervised fine-tuning (SFT)**. The 32B model reached 75.8 and improved further to 77.9 through annealing training, approaching state-of-the-art open-source performance.
 73 | 
 74 | <img src="assets/DeepDistill.png" alt="alt text" width="600px">
 75 | 
 76 | ### [Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability](https://arxiv.org/abs/2504.09639)
 77 | 
 78 | Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate ”thinking” steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
 79 | 
 80 | 1. **Approach**: To evaluate different strategies for leveraging reasoning models to generate informative responses, we explored three distinct methods
 81 |    - **1.Original Response:**:  This method utilizes the raw response directly from the community dataset,serving as our baseline for comparison
 82 |    - **2.Direct Reasoning Model Output (Answer Component):**: This approach uses only the answer component generated directly by reasoning model
 83 |    - **3.Think Summarization:**: this method first summarizes the thinking component using the summarization model.This summary capturing the essential problem-solving steps, is then prepended to the reasoning model’s original answer component
 84 | 
 85 | 2. **Conclusion**: The results presented in this paper affirm that supervised fine-tuning (SFT) using response data derived from reasoning models can significantly enhance the performance of target language models. 
 86 | 
 87 | <img src="assets/Leveraging-Reasoning-Model-Answers-to-Enhance-Non-Reasoning-Model-Capability.png" alt="alt text" width="600px">
 88 | 
 89 | ### [How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study](https://arxiv.org/abs/2504.00829)[![Generic badge](https://img.shields.io/badge/🤗-AM_Math_Difficulty_RL-green.svg)](https://huggingface.co/datasets/a-m-team/AM-Math-Difficulty-RL)
 90 | 
 91 | 
 92 | Improving the reasoning abilities of Large Language Models (LLMs) efficiently and at scale is a key challenge in AI research. This paper investigates how difficulty-aware staged reinforcement learning (RL) strategies can boost LLM performance. We show that selecting training data based on difficulty levels enhances RL optimization. Additionally, we propose a staged training approach that gradually exposes models to more challenging tasks, improving their reasoning capabilities. Besides, our results highlight significant benefits when training models on both mathematical reasoning and code generation tasks.
 93 | 
 94 | #### 1. Data Difficulty Selection
 95 | 
 96 | Carefully selecting RL training data based on appropriate difficulty metrics is critical. A moderate difficulty level enhances learning efficiency, balancing the need for adequate challenge against the risk of overwhelming the learning process with overly difficult scenarios.
 97 | 
 98 | <img src="assets/staged-RL-data-difficulty.png" alt="alt text" width="600px">
 99 | 
100 | #### 2. Staged Training
101 | 
102 | By selecting appropriately challenging data and incorporating staged training, we can significantly improve the performance of LLMs on reasoning tasks. (Due to the absence of code-related training data, its performance on LiveCodeBench is essentially the same as that of the base model.)
103 | 
104 | <img src="assets/staged-RL-2stage.png" alt="alt text" width="600px">
105 | 
106 | #### 3. Simultaneous Training on Mathematics and Code
107 | 
108 | Mixing mathematical reasoning and code generation tasks during training results in cross-domain improvements, providing strong evidence for the benefits of multi-domain training.
109 | 
110 | <img src="assets/staged-RL-math-code.png" alt="alt text" width="600px">
111 | 
112 | ### [Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking](https://arxiv.org/abs/2503.19855)
113 | 
114 | Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach—\textbf{Multi-round Thinking}. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3\% (Round 1) to 82.1\% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7\% to 82.0\%. These results confirm that \textbf{Multi-round Thinking} is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques.
115 | 
116 | The key prompt:
117 | ```
118 | Original question prompt
119 | The assistant’s previous answer is: <answer> last round answer </answer>, and please re-answer.
120 | ```
121 | 
122 | <img src="assets/Think-Twice-QwQ.png" alt="alt text" width="600px">
123 | <img src="assets/Think-Twice-DeepSeek-R1.png" alt="alt text" width="600px">
124 | 
125 | ### Model Performance Comparison (pass@1 accuracy) Between Single-round (Round 1) and Multi-round Thinking (Round 2-4) Across Different Benchmarks
126 | 
127 | | **Model**                              | **Round** | **AIME 2024 pass@1** | **MATH500 pass@1** | **GPQA-Diamond pass@1** | **LiveCodeBench pass@1** | **Average** |
128 | |----------------------------------------|-----------|----------------------|--------------------|-------------------------|--------------------------|-------------|
129 | | **Deepseek-R1**                        | 1         | 79.7                 | 97.6               | 74.0                    | 65.3                     | 79.2        |
130 | |                                        | **2**     | **82.0**             | **97.6**           | **74.8**                | **67.1**                 | **80.4**    |
131 | | **QwQ-32B**                            | 1         | 80.3                 | 97.2               | 65.9                    | 63.0                     | 76.6        |
132 | |                                        | 2         | 82.1                 | 97.8               | 67.2                    | 64.7                     | 78.0        |
133 | |                                        | 3         | 82.8                 | 97.8               | 67.5                    | 65.2                     | 78.3        |
134 | |                                        | **4**     | **83.1**             | **97.7**           | **68.1**                | **66.0**                 | **78.7**    |
135 | | **DeepSeek-R1-Distill-Qwen-32B**       | 1         | 72.0                 | 96.0               | 60.1                    | 57.0                     | 71.3        |
136 | |                                        | **2**     | **75.1**             | **96.3**           | **61.3**                | **57.6**                 | **72.6**    |
137 | | **DeepSeek-R1-Distill-Qwen-7B**        | 1         | 56.9                 | 93.4               | 49.2                    | 35.0                     | 58.6        |
138 | |                                        | **2**     | **58.4**             | **93.9**           | **49.4**                | **36.7**                 | **59.6**    |
139 | | **AM-Distill-Qwen-32B**                | 1         | 72.8                 | 96.2               | 62.3                    | 58.3                     | 72.4        |
140 | |                                        | **2**     | **76.7**             | **97.2**           | **62.8**                | **60.2**                 | **74.2**    |
141 | 
142 | 
143 | ### [1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training](https://arxiv.org/abs/2503.19633) [![Generic badge](https://img.shields.io/badge/🤗-1.4M-green.svg)](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M)
144 | 
145 | The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (pre-dominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in <https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M>.
146 | 
147 | <img src="assets/AM-DeepSeek-R1-Distilled.jpeg" alt="alt text" width="600px">
148 | 
149 | 
150 | ## Citation
151 | 
152 | If you find our work helpful to your research, please star our repository :star: and cite our work :pencil:
153 | 
154 | ```BibTeX
155 | @misc{tian2025correctanswersequaldistillation,
156 |       title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters}, 
157 |       author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li},
158 |       year={2025},
159 |       eprint={2505.14464},
160 |       archivePrefix={arXiv},
161 |       primaryClass={cs.CL},
162 |       url={https://arxiv.org/abs/2505.14464}, 
163 | }
164 | 
165 | @misc{ji2025amthinkingv1advancingfrontierreasoning,
166 |       title={AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale}, 
167 |       author={Yunjie Ji and Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Han Zhao and Xiangang Li},
168 |       year={2025},
169 |       eprint={2505.08311},
170 |       archivePrefix={arXiv},
171 |       primaryClass={cs.CL},
172 |       url={https://arxiv.org/abs/2505.08311}, 
173 | }
174 | 
175 | @misc{tian2025exploringpotentialofflinerl,
176 |       title={Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study}, 
177 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
178 |       year={2025},
179 |       eprint={2505.02142},
180 |       archivePrefix={arXiv},
181 |       primaryClass={cs.CL},
182 |       url={https://arxiv.org/abs/2505.02142}, 
183 | }
184 | 
185 | @misc{tian2025deepdistillenhancingllmreasoning,
186 |       title={DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training}, 
187 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
188 |       year={2025},
189 |       eprint={2504.17565},
190 |       archivePrefix={arXiv},
191 |       primaryClass={cs.CL},
192 |       url={https://arxiv.org/abs/2504.17565}, 
193 | }
194 | 
195 | @misc{wang2025leveragingreasoningmodelanswers,
196 |       title={Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability}, 
197 |       author={Haotian Wang and Han Zhao and Shuaiting Chen and Xiaoyu Tian and Sitong Zhao and Yunjie Ji and Yiping Peng and Xiangang Li},
198 |       year={2025},
199 |       eprint={2504.09639},
200 |       archivePrefix={arXiv},
201 |       primaryClass={cs.CL},
202 |       url={https://arxiv.org/abs/2504.09639}, 
203 | }
204 | 
205 | @misc{ji2025difficultyawarestagedreinforcementlearning,
206 |       title={How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study}, 
207 |       author={Yunjie Ji and Sitong Zhao and Xiaoyu Tian and Haotian Wang and Shuaiting Chen and Yiping Peng and Han Zhao and Xiangang Li},
208 |       year={2025},
209 |       eprint={2504.00829},
210 |       archivePrefix={arXiv},
211 |       primaryClass={cs.CL},
212 |       url={https://arxiv.org/abs/2504.00829}, 
213 | }
214 | 
215 | @misc{tian2025thinktwiceenhancingllm,
216 |       title={Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking}, 
217 |       author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yunjie Ji and Yiping Peng and Han Zhao and Xiangang Li},
218 |       year={2025},
219 |       eprint={2503.19855},
220 |       archivePrefix={arXiv},
221 |       primaryClass={cs.CL},
222 |       url={https://arxiv.org/abs/2503.19855}, 
223 | }
224 | 
225 | @misc{zhao202514millionopensourcedistilled,
226 |       title={1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training}, 
227 |       author={Han Zhao and Haotian Wang and Yiping Peng and Sitong Zhao and Xiaoyu Tian and Shuaiting Chen and Yunjie Ji and Xiangang Li},
228 |       year={2025},
229 |       eprint={2503.19633},
230 |       archivePrefix={arXiv},
231 |       primaryClass={cs.CL},
232 |       url={https://arxiv.org/abs/2503.19633}, 
233 | }
234 | 
235 | ```
236 | 


--------------------------------------------------------------------------------
/assets/AM-DeepSeek-R1-Distilled.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/AM-DeepSeek-R1-Distilled.jpeg


--------------------------------------------------------------------------------
/assets/DeepDistill.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/DeepDistill.png


--------------------------------------------------------------------------------
/assets/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.png


--------------------------------------------------------------------------------
/assets/Leveraging-Reasoning-Model-Answers-to-Enhance-Non-Reasoning-Model-Capability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/Leveraging-Reasoning-Model-Answers-to-Enhance-Non-Reasoning-Model-Capability.png


--------------------------------------------------------------------------------
/assets/Not-All-Correct-Answers-Are-Equal-Why-Your-Distillation-Source-Matters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/Not-All-Correct-Answers-Are-Equal-Why-Your-Distillation-Source-Matters.png


--------------------------------------------------------------------------------
/assets/Think-Twice-DeepSeek-R1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/Think-Twice-DeepSeek-R1.png


--------------------------------------------------------------------------------
/assets/Think-Twice-QwQ.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/Think-Twice-QwQ.png


--------------------------------------------------------------------------------
/assets/am-thinking-v1-benchmark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/am-thinking-v1-benchmark.png


--------------------------------------------------------------------------------
/assets/am-thinking-v1-results_with_params.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/am-thinking-v1-results_with_params.jpg


--------------------------------------------------------------------------------
/assets/am_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/am_logo.png


--------------------------------------------------------------------------------
/assets/staged-RL-2stage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/staged-RL-2stage.png


--------------------------------------------------------------------------------
/assets/staged-RL-data-difficulty.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/staged-RL-data-difficulty.png


--------------------------------------------------------------------------------
/assets/staged-RL-math-code.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/assets/staged-RL-math-code.png


--------------------------------------------------------------------------------
/docs/AM-DeepSeek-R1-Distilled-Dataset.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/AM-DeepSeek-R1-Distilled-Dataset.pdf


--------------------------------------------------------------------------------
/docs/AM-Thinking-v1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/AM-Thinking-v1.pdf


--------------------------------------------------------------------------------
/docs/DeepDistill.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/DeepDistill.pdf


--------------------------------------------------------------------------------
/docs/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/Exploring-the-Potential-of-Offline-RL-for-Reasoning-in-LLMs-A-Preliminary-Study.pdf


--------------------------------------------------------------------------------
/docs/How-Difficulty-Aware-Staged-Reinforcement-Learning-Enhances-LLMs-Reasoning-Capabilities-A-Preliminary-Experimental-Study.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/How-Difficulty-Aware-Staged-Reinforcement-Learning-Enhances-LLMs-Reasoning-Capabilities-A-Preliminary-Experimental-Study.pdf


--------------------------------------------------------------------------------
/docs/Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability.pdf


--------------------------------------------------------------------------------
/docs/Not All Correct Answers Are Equal- Why Your Distillation Source Matters.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/Not All Correct Answers Are Equal- Why Your Distillation Source Matters.pdf


--------------------------------------------------------------------------------
/docs/Think-Twice.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/a-m-team/a-m-models/3beecba4d14590df269442ae2a2cd44fe09fa657/docs/Think-Twice.pdf


--------------------------------------------------------------------------------