├── proposal.md ├── .gitignore ├── planning.md ├── README.md ├── datasets.md ├── papers.md ├── books.md ├── blogs.md ├── questions.md └── implementations.md /proposal.md: -------------------------------------------------------------------------------- 1 | # Proposal 2 | 3 | 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Python-generated files 2 | __pycache__/ 3 | *.py[oc] 4 | build/ 5 | dist/ 6 | wheels/ 7 | *.egg-info 8 | 9 | # Virtual environments 10 | .venv 11 | -------------------------------------------------------------------------------- /planning.md: -------------------------------------------------------------------------------- 1 | # TODOs 2 | 3 | ## nano-R1/resources Repo 4 | - Continue adding links to existing resources 5 | - Flesh out list of open questions 6 | 7 | ## Crowdsourcing Research 8 | - Decide on scope and format for v1 "leaderboard" 9 | - Standard set of benchmarks 10 | - Reference implementations with core training libraries 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # nano-R1 resources 2 | 3 | ## Books, blogs, and papers 4 | 5 | See: 6 | - [books.md](books.md) (books + tutorials) 7 | - [blogs.md](blogs.md) 8 | - [papers.md](papers.md) 9 | 10 | ## Datasets and benchmarks 11 | 12 | See [datasets.md](datasets.md) 13 | 14 | ## Implementations 15 | 16 | See [implementations.md](implementations.md) 17 | 18 | ## Open questions 19 | 20 | See [questions.md](questions.md) 21 | 22 | ## Planning 23 | 24 | See [planning.md](planning.md) 25 | 26 | ## Proposal 27 | 28 | See [proposal.md](proposal.md) 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /datasets.md: -------------------------------------------------------------------------------- 1 | # Reasoning Datasets and Benchmarks 2 | 3 | ## Canonical Benchmarks 4 | 5 | ### Math 6 | 7 | ### Coding 8 | 9 | ### Logic and Reasoning 10 | 11 | **ARC-AGI** 12 | - [GitHub repo](https://github.com/fchollet/ARC-AGI) 13 | - [arcprize.org](https://arcprize.org/arc) 14 | 15 | ### Knowledge 16 | 17 | ### Agentic Tasks 18 | 19 | ### Instruction-Following 20 | 21 | ## Reasoning Model Traces (R1, QwQ, etc.) 22 | 23 | 24 | ## Programmatically-Generated 25 | 26 | - Tic Tac Toe 27 | - reasoning-gym 28 | - text-arena 29 | - Sudoku 30 | - temporal-clue 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /papers.md: -------------------------------------------------------------------------------- 1 | # Papers 2 | 3 | **WIP** 4 | 5 | Lots here: [awesome-RLHF](https://github.com/opendilab/awesome-RLHF) 6 | 7 | ## GRPO 8 | 9 | - DeepSeekMath 10 | - DeepSeek-V2 11 | - DeepSeek-V3 12 | - DeepSeek-R1 13 | - SWE-RL 14 | - AlphaMaze 15 | - [R1-Searcher](https://arxiv.org/pdf/2503.05592) 16 | - What is the Alignment Objective of GRPO? 17 | - GRPO for Image Captioning 18 | 19 | ## PPO implementation details (tabula rasa) 20 | 21 | - [37 Implementation Details of PPO](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) 22 | 23 | - [Implementation Matters in PPO and TRPO](https://arxiv.org/abs/2005.12729) 24 | 25 | - Tulu 3 26 | -------------------------------------------------------------------------------- /books.md: -------------------------------------------------------------------------------- 1 | # Books + Tutorials 2 | 3 | ## LLM Reinforcement Learning 4 | Alignment Handbook - Hugging Face 5 | - Repo: [huggingface/alignment-handbook](https://github.com/huggingface/alignment-handbook) 6 | 7 | RLHF Book - Nathan Lambert 8 | - Page: [rlhfbook.com](https://rlhfbook.com/) 9 | - PDF: [A Little Bit of RL from Human Feedback](https://rlhfbook.com/book.pdf) 10 | - Repo: [natolambert/rlhf-book](https://github.com/natolambert/rlhf-book) 11 | 12 | 13 | ## LLMs 14 | 15 | Physics of Language Models - Zeyuan Allen-Zhu 16 | - Page: [physics.allen-zhu.com](https://physics.allen-zhu.com/) 17 | 18 | Generative AI Handbook 19 | - Page: [genai-handbook.github.io](https://genai-handbook.github.io/) 20 | 21 | ## RL 22 | 23 | Reinforcement Learning - Sutton + Barto 24 | - PDF: [RL](http://incompleteideas.net/book/RLbook2020.pdf) 25 | - YouTube series: [RL By The Book - Mutual Information](https://www.youtube.com/watch?v=NFo9v_yKQXA&list=PLzvYlJMoZ02Dxtwe-MmH4nOB5jYlMGBjr) 26 | 27 | Spinning Up in Deep RL - OpenAI 28 | - Page: [spinningup.openai.com](https://spinningup.openai.com/en/latest/) 29 | -------------------------------------------------------------------------------- /blogs.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## GRPO Applications 4 | 5 | Teaching Language Models to Solve Sudoku Through Reinforcement Learning 6 | - Author: [Hrishbh Dalal](https://x.com/HrishbhDalal) 7 | - Date: 3/10/2025 8 | - Post: [hrishbh.com](https://hrishbh.com/teaching-language-models-to-solve-sudoku-through-reinforcement-learning/) 9 | 10 | Why does GRPO work? 11 | GRPO Judge Experiments: Findings & Empirical Observations 12 | - Author: [kalomaze](https://x.com/kalomaze) 13 | - Date: 3/3/2025 14 | - Post: [kalomaze.bearblog.dev](https://kalomaze.bearblog.dev/grpo-judge-experiments-findings-and-empirical-observations/) 15 | 16 | 17 | Train your own R1 reasoning model 18 | - Authors: Daniel + Michael (Unsloth) 19 | - Date: 2/6/2025 20 | - Post: [unsloth.ai](https://unsloth.ai/blog/r1-reasoning) 21 | 22 | ## GRPO Intuitions 23 | 24 | Why does GRPO work? 25 | - Author: [kalomaze](https://x.com/kalomaze) 26 | - Date: 2/27/2025 27 | - Post: [kalomaze.bearblog.dev](https://kalomaze.bearblog.dev/why-does-grpo-work/) 28 | 29 | 30 | ## PPO vs. DPO 31 | 32 | RLHF roundup: Getting good at PPO, sketching RLHF’s impact, RewardBench retrospective, and a reward model competition 33 | - Author: [Nathan Lambert](https://x.com/natolambert) 34 | 35 | - Date: 6/26/24 36 | - Post: [interconnects.ai](https://www.interconnects.ai/p/rlhf-roundup-2024) 37 | -------------------------------------------------------------------------------- /questions.md: -------------------------------------------------------------------------------- 1 | # Open Questions 2 | 3 | Questions to resolve in pursuit of an optimal recipe for reasoning-oriented RL with LLMs 4 | 5 | ## Physics of LLM RL 6 | 7 | - At what **base** model scale (in parameters) does RL for reasoning become minimally viable (e.g. >50% on GSM8K)? 8 | - At what **instruct** model scale (in parameters) does RL for reasoning become minimally viable (e.g. >50% on GSM8K)? 9 | - At what model scale does **outcome-only** RL for reasoning become more computationally 10 | - Does the presence of more LLM-generated chain-of-thought traces on the web from 2022-2024 explain (in part) the emergence of increased CoT during verifiable RL (as in R1)? 11 | - Under what conditions does RL become "better" than distillation from reasoning models like R1? 12 | - Do RL-trained reasoning models generalize OOD better than 13 | - What factors influence the OOD generalization capabilities of RL-trained reasoning models? 14 | 15 | ## GRPO Implementation Details 16 | 17 | - How do off-policy steps affect training stability? 18 | - Note: if training is stable one step off-policy for all rollouts, then rollout inference + model updates can be fully overlapped in theory 19 | - What are the scaling laws for group size + batch size? 20 | 21 | ## PPO Implementation Details 22 | 23 | ## PPO vs. GRPO vs. RLOO vs. PRIME vs. KTO vs. DPO vs. ... 24 | 25 | ## Multi-Turn + Agentic RL 26 | 27 | ## Multi-Agent RL 28 | 29 | 30 | -------------------------------------------------------------------------------- /implementations.md: -------------------------------------------------------------------------------- 1 | # RL Implementations 2 | 3 | ## LLM RL Repos 4 | **TRL-based:** 5 | - [TRL](https://github.com/huggingface/trl/tree/main/trl) 6 | - Algorithms: PPO, GRPO, KTO, RLOO, (O)DPO, XPO, others 7 | - Backends: DDP, accelerate (FSDP, DeepSpeed ZeRO 1/2/3) 8 | - Inference: transformers, vLLM 9 | - [axolotl](https://github.com/axolotl-ai-cloud/axolotl) 10 | - Algorithms: same as TRL 11 | - Backends: accelerate, FSDP + QLoRA, Ray 12 | - [Unsloth](https://github.com/unslothai/unsloth) 13 | - Algorithms: same as TRL 14 | - Backends: custom (single-GPU PEFT, memory-optimized) 15 | - [rStar](https://github.com/microsoft/rStar) 16 | - Algorithms: rStar 17 | - Inference: vLLM 18 | - [LlamaGym](https://github.com/KhoomeiK/LlamaGym) 19 | - Algorithms: PPO (multi-turn) 20 | - Inference: transformers 21 | - [verifiers](https://github.com/willccbb/verifiers) 22 | - Algorithms: GRPO (multi-turn) 23 | - Inference: vLLM 24 | - [groundlight/r1-vlm](https://github.com/groundlight/r1_vlm) 25 | - Algorithms: GRPO (multimodal) 26 | - Inference: vLLM 27 | - [VLM-R1](https://github.com/om-ai-lab/VLM-R1) 28 | - Algorithms: GRPO (multimodal) 29 | - Inference: transformers, vLLM 30 | 31 | **veRL-based:** 32 | - [veRL](https://github.com/volcengine/verl) 33 | - Algorithms: PPO, GRPO, PRIME 34 | - Backends: FSDP, Megatron-LM 35 | - Inference: transformers, vLLM, SGLang ([open PR](https://github.com/volcengine/verl/pull/490)) 36 | - [RAGEN](https://github.com/ZihanWang314/RAGEN) 37 | - Algorithms: PPO + RICO (multi-turn) 38 | - [PRIME](https://github.com/PRIME-RL/PRIME) 39 | - Algorithms: PRIME 40 | 41 | **OpenRLHF-based:** 42 | - [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) 43 | - Algorithms: PPO, GRPO, DPO, KTO, RLOO, REINFORCE++ 44 | - Backends: Ray 45 | - Inference: vLLM 46 | - [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) 47 | - Algorithms: PPO 48 | - [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher) 49 | - Algorithms: REINFORCE++ 50 | 51 | **torchtune-based:** 52 | - [torchtune](https://github.com/pytorch/torchtune) 53 | - Algorithms: DPO, PPO, GRPO 54 | - [OpenPipe/deductive-reasoning](https://github.com/OpenPipe/deductive-reasoning) 55 | - Algorithms: GRPO (KL-free) 56 | 57 | 58 | **torch (standalone)** 59 | - [oat](https://github.com/sail-sg/oat/tree/main) 60 | - Algorithms: PPO, (O)DPO, XPO 61 | - Backends: accelerate (DeepSpeed) 62 | - Inference: vLLM + Mosec 63 | - [allenai/open-instruct](https://github.com/allenai/open-instruct) 64 | - Algorithms: PPO, DPO, GRPO 65 | - Backends: accelerate 66 | - Inference: vLLM 67 | - [VinePPO/treetune](https://github.com/McGill-NLP/VinePPO) 68 | - Algorithms: PPO, DPO, VinePPO, RestEM 69 | - [Lamorel](https://github.com/flowersteam/lamorel/tree/main) 70 | - Algorithms: PPO 71 | - Backends: accelerate 72 | - Inference: transformers 73 | - [ReMax](https://github.com/liziniu/ReMax) 74 | - Algorithms: ReMax 75 | 76 | **MLX** 77 | - [mlx-lm](https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md) 78 | - Algorithms: DPO ([open PR](https://github.com/ml-explore/mlx-examples/pull/1279)) 79 | 80 | **Jax/Flax** 81 | - [EasyDeL](https://github.com/erfanzar/EasyDeL) 82 | - Algorithms: GRPO, DPO, ORPO 83 | - Backends: Triton, XLA, Pallas 84 | - Inference: vInference 85 | 86 | ## Deep RL Repos (non-LLM) 87 | 88 | **Jax/Flax** 89 | - [JAX-PPO](https://github.com/zombie-einstein/JAX-PPO) 90 | 91 | **MLX** 92 | - [clean-rl-mlx](https://github.com/andrew-silva/clean-rl-mlx) 93 | 94 | 95 | --------------------------------------------------------------------------------