├── proposal.md
├── .gitignore
├── planning.md
├── README.md
├── datasets.md
├── papers.md
├── books.md
├── blogs.md
├── questions.md
└── implementations.md


/proposal.md:
--------------------------------------------------------------------------------
1 | # Proposal
2 | 
3 | 
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Python-generated files
 2 | __pycache__/
 3 | *.py[oc]
 4 | build/
 5 | dist/
 6 | wheels/
 7 | *.egg-info
 8 | 
 9 | # Virtual environments
10 | .venv
11 | 


--------------------------------------------------------------------------------
/planning.md:
--------------------------------------------------------------------------------
 1 | # TODOs
 2 | 
 3 | ## nano-R1/resources Repo
 4 | - Continue adding links to existing resources
 5 | - Flesh out list of open questions
 6 | 
 7 | ## Crowdsourcing Research
 8 | - Decide on scope and format for v1 "leaderboard"
 9 | - Standard set of benchmarks
10 | - Reference implementations with core training libraries 
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # nano-R1 resources
 2 | 
 3 | ## Books, blogs, and papers
 4 | 
 5 | See:
 6 | - [books.md](books.md) (books + tutorials)
 7 | - [blogs.md](blogs.md)
 8 | - [papers.md](papers.md)
 9 | 
10 | ## Datasets and benchmarks
11 | 
12 | See [datasets.md](datasets.md)
13 | 
14 | ## Implementations
15 | 
16 | See [implementations.md](implementations.md)
17 | 
18 | ## Open questions
19 | 
20 | See [questions.md](questions.md)
21 | 
22 | ## Planning
23 | 
24 | See [planning.md](planning.md)
25 | 
26 | ## Proposal
27 | 
28 | See [proposal.md](proposal.md)
29 | 
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/datasets.md:
--------------------------------------------------------------------------------
 1 | # Reasoning Datasets and Benchmarks
 2 | 
 3 | ## Canonical Benchmarks
 4 | 
 5 | ### Math
 6 | 
 7 | ### Coding
 8 | 
 9 | ### Logic and Reasoning
10 | 
11 | **ARC-AGI**
12 | - [GitHub repo](https://github.com/fchollet/ARC-AGI)
13 | - [arcprize.org](https://arcprize.org/arc)
14 | 
15 | ### Knowledge
16 | 
17 | ### Agentic Tasks
18 | 
19 | ### Instruction-Following
20 | 
21 | ## Reasoning Model Traces (R1, QwQ, etc.)
22 | 
23 | 
24 | ## Programmatically-Generated
25 | 
26 | - Tic Tac Toe
27 | - reasoning-gym
28 | - text-arena
29 | - Sudoku
30 | - temporal-clue
31 | 
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/papers.md:
--------------------------------------------------------------------------------
 1 | # Papers
 2 | 
 3 | **WIP**
 4 | 
 5 | Lots here: [awesome-RLHF](https://github.com/opendilab/awesome-RLHF)
 6 | 
 7 | ## GRPO
 8 | 
 9 | - DeepSeekMath
10 | - DeepSeek-V2
11 | - DeepSeek-V3
12 | - DeepSeek-R1
13 | - SWE-RL
14 | - AlphaMaze
15 | - [R1-Searcher](https://arxiv.org/pdf/2503.05592)
16 | - What is the Alignment Objective of GRPO?
17 | - GRPO for Image Captioning
18 | 
19 | ## PPO implementation details (tabula rasa)
20 | 
21 | - [37 Implementation Details of PPO](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
22 | 
23 | - [Implementation Matters in PPO and TRPO](https://arxiv.org/abs/2005.12729)
24 | 
25 | - Tulu 3
26 | 


--------------------------------------------------------------------------------
/books.md:
--------------------------------------------------------------------------------
 1 | # Books + Tutorials
 2 | 
 3 | ## LLM Reinforcement Learning
 4 | Alignment Handbook - Hugging Face
 5 | - Repo: [huggingface/alignment-handbook](https://github.com/huggingface/alignment-handbook)
 6 | 
 7 | RLHF Book - Nathan Lambert
 8 | - Page: [rlhfbook.com](https://rlhfbook.com/)
 9 | - PDF: [A Little Bit of RL from Human Feedback](https://rlhfbook.com/book.pdf)
10 | - Repo: [natolambert/rlhf-book](https://github.com/natolambert/rlhf-book)
11 | 
12 | 
13 | ## LLMs
14 | 
15 | Physics of Language Models - Zeyuan Allen-Zhu
16 | - Page: [physics.allen-zhu.com](https://physics.allen-zhu.com/)
17 | 
18 | Generative AI Handbook
19 | - Page: [genai-handbook.github.io](https://genai-handbook.github.io/)
20 | 
21 | ## RL
22 | 
23 | Reinforcement Learning - Sutton + Barto
24 | - PDF: [RL](http://incompleteideas.net/book/RLbook2020.pdf)
25 | - YouTube series: [RL By The Book - Mutual Information](https://www.youtube.com/watch?v=NFo9v_yKQXA&list=PLzvYlJMoZ02Dxtwe-MmH4nOB5jYlMGBjr)
26 | 
27 | Spinning Up in Deep RL - OpenAI
28 | - Page: [spinningup.openai.com](https://spinningup.openai.com/en/latest/)
29 | 


--------------------------------------------------------------------------------
/blogs.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## GRPO Applications
 4 | 
 5 | Teaching Language Models to Solve Sudoku Through Reinforcement Learning
 6 | - Author: [Hrishbh Dalal](https://x.com/HrishbhDalal)
 7 | - Date: 3/10/2025
 8 | - Post: [hrishbh.com](https://hrishbh.com/teaching-language-models-to-solve-sudoku-through-reinforcement-learning/)
 9 | 
10 | Why does GRPO work?
11 | GRPO Judge Experiments: Findings & Empirical Observations
12 | - Author: [kalomaze](https://x.com/kalomaze)
13 | - Date: 3/3/2025
14 | - Post: [kalomaze.bearblog.dev](https://kalomaze.bearblog.dev/grpo-judge-experiments-findings-and-empirical-observations/)
15 | 
16 | 
17 | Train your own R1 reasoning model
18 | - Authors: Daniel + Michael (Unsloth)
19 | - Date: 2/6/2025
20 | - Post: [unsloth.ai](https://unsloth.ai/blog/r1-reasoning)
21 | 
22 | ## GRPO Intuitions
23 | 
24 | Why does GRPO work?
25 | - Author: [kalomaze](https://x.com/kalomaze)
26 | - Date: 2/27/2025
27 | - Post: [kalomaze.bearblog.dev](https://kalomaze.bearblog.dev/why-does-grpo-work/)
28 | 
29 | 
30 | ## PPO vs. DPO
31 | 
32 | RLHF roundup: Getting good at PPO, sketching RLHF’s impact, RewardBench retrospective, and a reward model competition
33 | - Author: [Nathan Lambert](https://x.com/natolambert)
34 | 
35 | - Date: 6/26/24
36 | - Post: [interconnects.ai](https://www.interconnects.ai/p/rlhf-roundup-2024)
37 | 


--------------------------------------------------------------------------------
/questions.md:
--------------------------------------------------------------------------------
 1 | # Open Questions
 2 | 
 3 | Questions to resolve in pursuit of an optimal recipe for reasoning-oriented RL with LLMs
 4 | 
 5 | ## Physics of LLM RL
 6 | 
 7 | - At what **base**  model scale (in parameters) does RL for reasoning become minimally viable (e.g. >50% on GSM8K)?
 8 | - At what **instruct**  model scale (in parameters) does RL for reasoning become minimally viable (e.g. >50% on GSM8K)?
 9 | - At what model scale does **outcome-only** RL for reasoning become more computationally 
10 | - Does the presence of more LLM-generated chain-of-thought traces on the web from 2022-2024 explain (in part) the emergence of increased CoT during verifiable RL (as in R1)?
11 | - Under what conditions does RL become "better" than distillation from reasoning models like R1?
12 | - Do RL-trained reasoning models generalize OOD better than 
13 | - What factors influence the OOD generalization capabilities of RL-trained reasoning models?
14 | 
15 | ## GRPO Implementation Details
16 | 
17 | - How do off-policy steps affect training stability? 
18 |     - Note: if training is stable one step off-policy for all rollouts, then rollout inference + model updates can be fully overlapped in theory
19 | - What are the scaling laws for group size + batch size? 
20 | 
21 | ## PPO Implementation Details
22 | 
23 | ## PPO vs. GRPO vs. RLOO vs. PRIME vs. KTO vs. DPO vs. ... 
24 | 
25 | ## Multi-Turn + Agentic RL
26 | 
27 | ## Multi-Agent RL
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/implementations.md:
--------------------------------------------------------------------------------
 1 | # RL Implementations
 2 | 
 3 | ## LLM RL Repos
 4 | **TRL-based:**
 5 | - [TRL](https://github.com/huggingface/trl/tree/main/trl)
 6 |     - Algorithms: PPO, GRPO, KTO, RLOO, (O)DPO, XPO, others
 7 |     - Backends: DDP, accelerate (FSDP, DeepSpeed ZeRO 1/2/3)
 8 |     - Inference: transformers, vLLM
 9 | - [axolotl](https://github.com/axolotl-ai-cloud/axolotl)
10 |     - Algorithms: same as TRL
11 |     - Backends: accelerate, FSDP + QLoRA, Ray
12 | - [Unsloth](https://github.com/unslothai/unsloth)
13 |     - Algorithms: same as TRL
14 |     - Backends: custom (single-GPU PEFT, memory-optimized)
15 | - [rStar](https://github.com/microsoft/rStar)
16 |     - Algorithms: rStar
17 |     - Inference: vLLM
18 | - [LlamaGym](https://github.com/KhoomeiK/LlamaGym)
19 |     - Algorithms: PPO (multi-turn)
20 |     - Inference: transformers
21 | - [verifiers](https://github.com/willccbb/verifiers)
22 |     - Algorithms: GRPO (multi-turn)
23 |     - Inference: vLLM
24 | - [groundlight/r1-vlm](https://github.com/groundlight/r1_vlm)
25 |     - Algorithms: GRPO (multimodal)
26 |     - Inference: vLLM
27 | - [VLM-R1](https://github.com/om-ai-lab/VLM-R1)
28 |     - Algorithms: GRPO (multimodal)
29 |     - Inference: transformers, vLLM
30 | 
31 | **veRL-based:**
32 | - [veRL](https://github.com/volcengine/verl)
33 |     - Algorithms: PPO, GRPO, PRIME
34 |     - Backends: FSDP, Megatron-LM
35 |     - Inference: transformers, vLLM, SGLang ([open PR](https://github.com/volcengine/verl/pull/490))
36 | - [RAGEN](https://github.com/ZihanWang314/RAGEN)
37 |     - Algorithms: PPO + RICO (multi-turn)
38 | - [PRIME](https://github.com/PRIME-RL/PRIME)
39 |     - Algorithms: PRIME
40 | 
41 | **OpenRLHF-based:**
42 | - [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
43 |     - Algorithms: PPO, GRPO, DPO, KTO, RLOO, REINFORCE++
44 |     - Backends: Ray 
45 |     - Inference: vLLM
46 | - [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero)
47 |     - Algorithms: PPO
48 | - [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher)
49 |     - Algorithms: REINFORCE++
50 | 
51 | **torchtune-based:**
52 | - [torchtune](https://github.com/pytorch/torchtune)
53 |     - Algorithms: DPO, PPO, GRPO
54 | - [OpenPipe/deductive-reasoning](https://github.com/OpenPipe/deductive-reasoning)
55 |     - Algorithms: GRPO (KL-free)
56 | 
57 | 
58 | **torch (standalone)**
59 | - [oat](https://github.com/sail-sg/oat/tree/main)
60 |     - Algorithms: PPO, (O)DPO, XPO
61 |     - Backends: accelerate (DeepSpeed)
62 |     - Inference: vLLM + Mosec
63 | - [allenai/open-instruct](https://github.com/allenai/open-instruct)
64 |     - Algorithms: PPO, DPO, GRPO
65 |     - Backends: accelerate
66 |     - Inference: vLLM
67 | - [VinePPO/treetune](https://github.com/McGill-NLP/VinePPO)
68 |     - Algorithms: PPO, DPO, VinePPO, RestEM
69 | - [Lamorel](https://github.com/flowersteam/lamorel/tree/main)
70 |     - Algorithms: PPO
71 |     - Backends: accelerate
72 |     - Inference: transformers
73 | - [ReMax](https://github.com/liziniu/ReMax)
74 |     - Algorithms: ReMax
75 | 
76 | **MLX**
77 | - [mlx-lm](https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md)
78 |     - Algorithms: DPO ([open PR](https://github.com/ml-explore/mlx-examples/pull/1279))
79 | 
80 | **Jax/Flax**
81 | - [EasyDeL](https://github.com/erfanzar/EasyDeL)
82 |     - Algorithms: GRPO, DPO, ORPO
83 |     - Backends: Triton, XLA, Pallas
84 |     - Inference: vInference
85 | 
86 | ## Deep RL Repos (non-LLM)
87 | 
88 | **Jax/Flax**
89 | - [JAX-PPO](https://github.com/zombie-einstein/JAX-PPO)
90 | 
91 | **MLX**
92 | - [clean-rl-mlx](https://github.com/andrew-silva/clean-rl-mlx)
93 | 
94 | 
95 | 


--------------------------------------------------------------------------------