└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Elicit Machine Learning Reading List
2 |
3 | ## Purpose
4 |
5 | The purpose of this curriculum is to help new [Elicit](https://elicit.com/) employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
6 |
7 | If you don’t work at Elicit yet - we’re [hiring ML and software engineers](https://elicit.com/careers).
8 |
9 | ## How to read
10 |
11 | Recommended reading order:
12 |
13 | 1. Read “Tier 1” for all topics
14 | 2. Read “Tier 2” for all topics
15 | 3. Etc
16 |
17 | ✨ Added after 2024/4/1
18 |
19 | ## Table of contents
20 |
21 | - [Fundamentals](#fundamentals)
22 | * [Introduction to machine learning](#introduction-to-machine-learning)
23 | * [Transformers](#transformers)
24 | * [Key foundation model architectures](#key-foundation-model-architectures)
25 | * [Training and finetuning](#training-and-finetuning)
26 | - [Reasoning and runtime strategies](#reasoning-and-runtime-strategies)
27 | * [In-context reasoning](#in-context-reasoning)
28 | * [Task decomposition](#task-decomposition)
29 | * [Debate](#debate)
30 | * [Tool use and scaffolding](#tool-use-and-scaffolding)
31 | * [Honesty, factuality, and epistemics](#honesty-factuality-and-epistemics)
32 | - [Applications](#applications)
33 | * [Science](#science)
34 | * [Forecasting](#forecasting)
35 | * [Search and ranking](#search-and-ranking)
36 | - [ML in practice](#ml-in-practice)
37 | * [Production deployment](#production-deployment)
38 | * [Benchmarks](#benchmarks)
39 | * [Datasets](#datasets)
40 | - [Advanced topics](#advanced-topics)
41 | * [World models and causality](#world-models-and-causality)
42 | * [Planning](#planning)
43 | * [Uncertainty, calibration, and active learning](#uncertainty-calibration-and-active-learning)
44 | * [Interpretability and model editing](#interpretability-and-model-editing)
45 | * [Reinforcement learning](#reinforcement-learning)
46 | - [The big picture](#the-big-picture)
47 | * [AI scaling](#ai-scaling)
48 | * [AI safety](#ai-safety)
49 | * [Economic and social impacts](#economic-and-social-impacts)
50 | * [Philosophy](#philosophy)
51 | - [Maintainer](#maintainer)
52 |
53 | ## Fundamentals
54 |
55 | ### Introduction to machine learning
56 |
57 | **Tier 1**
58 |
59 | - [A short introduction to machine learning](https://www.alignmentforum.org/posts/qE73pqxAZmeACsAdF/a-short-introduction-to-machine-learning)
60 | - [But what is a neural network?](https://www.youtube.com/watch?v=aircAruvnKk&t=0s)
61 | - [Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w)
62 |
63 | **Tier 2**
64 |
65 | - ✨ [An intuitive understanding of backpropagation](https://cs231n.github.io/optimization-2/)
66 | - [What is backpropagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)
67 | - [An introduction to deep reinforcement learning](https://thomassimonini.medium.com/an-introduction-to-deep-reinforcement-learning-17a565999c0c)
68 |
69 | **Tier 3**
70 |
71 | - [The spelled-out intro to neural networks and backpropagation: building micrograd](https://www.youtube.com/watch?v=VMj-3S1tku0)
72 | - [Backpropagation calculus](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=5)
73 |
74 | ### Transformers
75 |
76 | **Tier 1**
77 |
78 | - ✨ [But what is a GPT? Visual intro to transformers](https://www.youtube.com/watch?v=wjZofJX0v4M)
79 | - ✨ [Attention in transformers, visually explained](https://www.youtube.com/watch?v=eMlx5fFNoYc)
80 | - ✨ [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention/)
81 | - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
82 | - [The Illustrated GPT-2 (Visualizing Transformer Language Models)](https://jalammar.github.io/illustrated-gpt2/)
83 |
84 | **Tier 2**
85 |
86 | - ✨ [Let's build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)
87 | - ✨ [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473)
88 | - [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)
89 | - [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
90 |
91 | **Tier 3**
92 |
93 | - [A Practical Survey on Faster and Lighter Transformers](https://arxiv.org/abs/2103.14636)
94 | - [TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second](https://arxiv.org/abs/2207.01848)
95 | - [Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets](https://arxiv.org/abs/2201.02177)
96 | - [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)
97 |
98 | Tier 4+
99 |
100 | - ✨ [Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks](https://arxiv.org/abs/2311.12997)
101 | - [Memorizing Transformers](https://arxiv.org/abs/2203.08913)
102 | - [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913)
103 |
104 |
105 |
106 | ### Key foundation model architectures
107 |
108 | **Tier 1**
109 |
110 | - [Language Models are Unsupervised Multitask Learners](https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe) (GPT-2)
111 | - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (GPT-3)
112 |
113 | **Tier 2**
114 |
115 | - ✨ [LLaMA: Open and Efficient Foundation Language Models](http://arxiv.org/abs/2302.13971) (LLaMA)
116 | - ✨ [Efficiently Modeling Long Sequences with Structured State Spaces](https://arxiv.org/abs/2111.00396) ([video](https://www.youtube.com/watch?v=EvQ3ncuriCM)) (S4)
117 | - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) (T5)
118 | - [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) (OpenAI Codex)
119 | - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) (OpenAI Instruct)
120 |
121 | **Tier 3**
122 |
123 | - ✨ [Mistral 7B](http://arxiv.org/abs/2310.06825) (Mistral)
124 | - ✨ [Mixtral of Experts](http://arxiv.org/abs/2401.04088) (Mixtral)
125 | - ✨ [Gemini: A Family of Highly Capable Multimodal Models](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) (Gemini)
126 | - ✨ [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752v1) (Mamba)
127 | - [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416) (Flan)
128 |
129 | Tier 4+
130 |
131 | - ✨ [Consistency Models](http://arxiv.org/abs/2303.01469)
132 | - ✨ [Model Card and Evaluations for Claude Models](https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf) (Claude 2)
133 | - ✨ [OLMo: Accelerating the Science of Language Models](http://arxiv.org/abs/2402.00838)
134 | - ✨ [PaLM 2 Technical Report](https://arxiv.org/abs/2305.10403) (Palm 2)
135 | - ✨ [Textbooks Are All You Need II: phi-1.5 technical report](http://arxiv.org/abs/2309.05463) (phi 1.5)
136 | - ✨ [Visual Instruction Tuning](http://arxiv.org/abs/2304.08485) (LLaVA)
137 | - [A General Language Assistant as a Laboratory for Alignment](https://arxiv.org/abs/2112.00861)
138 | - [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652) (Google Instruct)
139 | - [Galactica: A Large Language Model for Science](https://arxiv.org/abs/2211.09085)
140 | - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) (Google Dialog)
141 | - [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2112.11446) (Meta GPT-3)
142 | - [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311) (PaLM)
143 | - [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) (Google Codex)
144 | - [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) (Gopher)
145 | - [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858) (Minerva)
146 | - [UL2: Unifying Language Learning Paradigms](http://aima.cs.berkeley.edu/) (UL2)
147 |
148 |
149 |
150 | ### Training and finetuning
151 |
152 | **Tier 2**
153 |
154 | - ✨ [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466)
155 | - [Learning to summarise with human feedback](https://arxiv.org/abs/2009.01325)
156 | - [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168)
157 |
158 | **Tier 3**
159 |
160 | - ✨ [Pretraining Language Models with Human Preferences](http://arxiv.org/abs/2302.08582)
161 | - ✨ [Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision](http://arxiv.org/abs/2312.09390)
162 | - [Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning](https://arxiv.org/abs/2205.05638v1)
163 | - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
164 | - [Unsupervised Neural Machine Translation with Generative Language Models Only](https://arxiv.org/abs/2110.05448)
165 |
166 | Tier 4+
167 |
168 | - ✨ [Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](http://arxiv.org/abs/2312.06585)
169 | - ✨ [Improving Code Generation by Training with Natural Language Feedback](http://arxiv.org/abs/2303.16749)
170 | - ✨ [Language Modeling Is Compression](https://arxiv.org/abs/2309.10668v1)
171 | - ✨ [LIMA: Less Is More for Alignment](http://arxiv.org/abs/2305.11206)
172 | - ✨ [Learning to Compress Prompts with Gist Tokens](http://arxiv.org/abs/2304.08467)
173 | - ✨ [Lost in the Middle: How Language Models Use Long Contexts](http://arxiv.org/abs/2307.03172)
174 | - ✨ [QLoRA: Efficient Finetuning of Quantized LLMs](http://arxiv.org/abs/2305.14314)
175 | - ✨ [Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](http://arxiv.org/abs/2403.09629)
176 | - ✨ [Reinforced Self-Training (ReST) for Language Modeling](http://arxiv.org/abs/2308.08998)
177 | - ✨ [Solving olympiad geometry without human demonstrations](https://www.nature.com/articles/s41586-023-06747-5)
178 | - ✨ [Tell, don't show: Declarative facts influence how LLMs generalize](http://arxiv.org/abs/2312.07779)
179 | - ✨ [Textbooks Are All You Need](http://arxiv.org/abs/2306.11644)
180 | - ✨ [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](http://arxiv.org/abs/2305.07759)
181 | - ✨ [Training Language Models with Language Feedback at Scale](http://arxiv.org/abs/2303.16755)
182 | - ✨ [Turing Complete Transformers: Two Transformers Are More Powerful Than One](https://openreview.net/forum?id=MGWsPGogLH)
183 | - [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626)
184 | - [Data Distributional Properties Drive Emergent In-Context Learning in Transformers](https://arxiv.org/abs/2205.05055)
185 | - [Diffusion-LM Improves Controllable Text Generation](https://arxiv.org/abs/2205.14217)
186 | - [ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/abs/2107.02137)
187 | - [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255)
188 | - [ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning](https://arxiv.org/abs/2111.10952)
189 | - [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/abs/2101.00190)
190 | - [Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning](https://arxiv.org/abs/2106.02584)
191 | - [True Few-Shot Learning with Prompts -- A Real-World Perspective](https://arxiv.org/abs/2111.13440)
192 |
193 |
194 |
195 | ## Reasoning and runtime strategies
196 |
197 | ### In-context reasoning
198 |
199 | **Tier 2**
200 |
201 | - [Chain of Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
202 | - [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916) (Let's think step by step)
203 | - [Self-Consistency Improves Chain of Thought Reasoning in Language Models](https://arxiv.org/abs/2203.11171)
204 |
205 | **Tier 3**
206 |
207 | - ✨ [Chain-of-Thought Reasoning Without Prompting](http://arxiv.org/abs/2402.10200)
208 | - ✨ [Why think step-by-step? Reasoning emerges from the locality of experience](http://arxiv.org/abs/2304.03843)
209 |
210 | Tier 4+
211 |
212 | - ✨ [Baldur: Whole-Proof Generation and Repair with Large Language Models](https://arxiv.org/abs/2303.04910v1)
213 | - ✨ [Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought](http://arxiv.org/abs/2403.05518)
214 | - ✨ [Certified Reasoning with Language Models](http://arxiv.org/abs/2306.04031)
215 | - ✨ [Hypothesis Search: Inductive Reasoning with Language Models](http://arxiv.org/abs/2309.05660)
216 | - ✨ [LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations](http://arxiv.org/abs/2305.18354)
217 | - ✨ [Large Language Models Cannot Self-Correct Reasoning Yet](https://arxiv.org/abs/2310.01798v1)
218 | - ✨ [Stream of Search (SoS): Learning to Search in Language](http://arxiv.org/abs/2404.03683)
219 | - ✨ [Training Chain-of-Thought via Latent-Variable Inference](http://arxiv.org/abs/2312.02179)
220 | - [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837)
221 | - [Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right](https://arxiv.org/abs/2104.08315)
222 |
223 |
224 |
225 | ### Task decomposition
226 |
227 | **Tier 1**
228 |
229 | - [Supervise Process, not Outcomes](https://ought.org/updates/2022-04-06-process)
230 | - [Supervising strong learners by amplifying weak experts](https://arxiv.org/abs/1810.08575)
231 |
232 | **Tier 2**
233 |
234 | - ✨ [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](http://arxiv.org/abs/2305.10601)
235 | - [Factored cognition](https://ought.org/research/factored-cognition)
236 | - [Iterated Distillation and Amplification](https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616)
237 | - [Recursively Summarizing Books with Human Feedback](https://arxiv.org/abs/2109.10862)
238 | - [Solving math word problems with process-based and outcome-based feedback](https://arxiv.org/abs/2211.14275)
239 |
240 | **Tier 3**
241 |
242 | - ✨ [Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers](https://arxiv.org/abs/2310.10627)
243 | - [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271)
244 | - [Humans consulting HCH](https://ai-alignment.com/humans-consulting-hch-f893f6051455)
245 | - [Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes](https://arxiv.org/abs/2301.01751)
246 | - [Language Model Cascades](https://arxiv.org/abs/2207.10342)
247 |
248 | Tier 4+
249 |
250 | - ✨ [Decontextualization: Making Sentences Stand-Alone](https://doi.org/10.1162/tacl_a_00377)
251 | - ✨ [Factored Cognition Primer](https://primer.ought.org)
252 | - ✨ [Graph of Thoughts: Solving Elaborate Problems with Large Language Models](http://arxiv.org/abs/2308.09687)
253 | - ✨ [Parsel: A Unified Natural Language Framework for Algorithmic Reasoning](http://arxiv.org/abs/2212.10561)
254 | - [AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts](https://arxiv.org/abs/2110.01691)
255 | - [Challenging BIG-Bench tasks and whether chain-of-thought can solve them](https://arxiv.org/abs/2210.09261)
256 | - [Evaluating Arguments One Step at a Time](https://ought.org/updates/2020-01-11-arguments)
257 | - [Least-to-Most Prompting Enables Complex Reasoning in Large Language Models](https://arxiv.org/abs/2205.11822)
258 | - [Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations](https://arxiv.org/abs/2205.11822)
259 | - [Measuring and narrowing the compositionality gap in language models](https://arxiv.org/abs/2210.03350)
260 | - [PAL: Program-aided Language Models](https://arxiv.org/abs/2211.10435)
261 | - [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)
262 | - [Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning](https://arxiv.org/abs/2205.10625)
263 | - [Show Your Work: Scratchpads for Intermediate Computation with Language Models](https://arxiv.org/abs/2112.00114)
264 | - [Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents](https://arxiv.org/abs/2110.10150)
265 | - [Thinksum: probabilistic reasoning over sets using large language models](https://arxiv.org/abs/2210.01293)
266 |
267 |
268 |
269 | ### Debate
270 |
271 | **Tier 2**
272 |
273 | - [AI safety via debate](https://openai.com/blog/debate/)
274 |
275 | **Tier 3**
276 |
277 | - ✨ [Debate Helps Supervise Unreliable Experts](https://twitter.com/joshua_clymer/status/1724851456967417872)
278 | - [Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions](https://arxiv.org/abs/2210.10860)
279 |
280 | Tier 4+
281 |
282 | - ✨ [Scalable AI Safety via Doubly-Efficient Debate](http://arxiv.org/abs/2311.14125)
283 | - ✨ [Improving Factuality and Reasoning in Language Models through Multiagent Debate](http://arxiv.org/abs/2305.14325)
284 |
285 |
286 |
287 | ### Tool use and scaffolding
288 |
289 | **Tier 2**
290 |
291 | - ✨ [Measuring the impact of post-training enhancements](https://metr.github.io/autonomy-evals-guide/elicitation-gap/)
292 | - [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332)
293 |
294 | **Tier 3**
295 |
296 | - ✨ [AI capabilities can be significantly improved without expensive retraining](http://arxiv.org/abs/2312.07413)
297 | - ✨ [Automated Statistical Model Discovery with Language Models](http://arxiv.org/abs/2402.17879)
298 |
299 | Tier 4+
300 |
301 | - ✨ [DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](http://arxiv.org/abs/2310.03714)
302 | - ✨ [Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution](http://arxiv.org/abs/2309.16797)
303 | - ✨ [Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation](https://arxiv.org/abs/2310.02304v1)
304 | - ✨ [Voyager: An Open-Ended Embodied Agent with Large Language Models](http://arxiv.org/abs/2305.16291)
305 | - [ReGAL: Refactoring Programs to Discover Generalizable Abstractions](http://arxiv.org/abs/2401.16467)
306 |
307 |
308 |
309 | ### Honesty, factuality, and epistemics
310 |
311 | **Tier 2**
312 |
313 | - ✨ [Self-critiquing models for assisting human evaluators](https://arxiv.org/abs/2206.05802v2)
314 |
315 | **Tier 3**
316 |
317 | - ✨ [What Evidence Do Language Models Find Convincing?](http://arxiv.org/abs/2402.11782)
318 | - ✨ [How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions](https://arxiv.org/abs/2309.15840)
319 |
320 | Tier 4+
321 |
322 | - ✨ [Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting](http://arxiv.org/abs/2305.04388)
323 | - ✨ [Long-form factuality in large language models](http://arxiv.org/abs/2403.18802)
324 |
325 |
326 |
327 | ## Applications
328 |
329 | ### Science
330 |
331 | **Tier 3**
332 |
333 | - ✨ [Can large language models provide useful feedback on research papers? A large-scale empirical analysis](http://arxiv.org/abs/2310.01783)
334 | - ✨ [Large Language Models Encode Clinical Knowledge](http://arxiv.org/abs/2212.13138)
335 | - ✨ [The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4](http://arxiv.org/abs/2311.07361)
336 | - [A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers](https://arxiv.org/abs/2105.03011)
337 |
338 | Tier 4+
339 |
340 | - ✨ [Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine](http://arxiv.org/abs/2311.16452)
341 | - ✨ [Nougat: Neural Optical Understanding for Academic Documents](http://arxiv.org/abs/2308.13418)
342 | - ✨ [Scim: Intelligent Skimming Support for Scientific Papers](http://arxiv.org/abs/2205.04561)
343 | - ✨ [SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design](https://www.biorxiv.org/content/10.1101/2023.07.06.547759v1)
344 | - ✨ [Towards Accurate Differential Diagnosis with Large Language Models](http://arxiv.org/abs/2312.00164)
345 | - ✨ [Towards a Benchmark for Scientific Understanding in Humans and Machines](http://arxiv.org/abs/2304.10327)
346 | - [A Search Engine for Discovery of Scientific Challenges and Directions](https://arxiv.org/abs/2108.13751)
347 | - [A full systematic review was completed in 2 weeks using automation tools: a case study](https://pubmed.ncbi.nlm.nih.gov/32004673/)
348 | - [Fact or Fiction: Verifying Scientific Claims](https://arxiv.org/abs/2004.14974)
349 | - [Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles](https://arxiv.org/abs/2010.14235)
350 | - [PEER: A Collaborative Language Model](https://arxiv.org/abs/2208.11663)
351 | - [PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/abs/1909.06146)
352 | - [SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts](https://arxiv.org/abs/2104.08809)
353 | - [SciTail: A Textual Entailment Dataset from Science Question Answering](http://ai2-website.s3.amazonaws.com/team/ashishs/scitail-aaai2018.pdf)
354 |
355 |
356 |
357 | ### Forecasting
358 |
359 | **Tier 3**
360 |
361 | - ✨ [AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy](https://arxiv.org/abs/2402.07862v1)
362 | - ✨ [Approaching Human-Level Forecasting with Language Models](http://arxiv.org/abs/2402.18563)
363 | - ✨ [Are Transformers Effective for Time Series Forecasting?](https://arxiv.org/abs/2205.13504)
364 | - [Forecasting Future World Events with Neural Networks](https://arxiv.org/abs/2206.15474)
365 |
366 | ### Search and ranking
367 |
368 | **Tier 2**
369 |
370 | - [Learning Dense Representations of Phrases at Scale](https://arxiv.org/abs/2012.12624)
371 | - [Text and Code Embeddings by Contrastive Pre-Training](https://arxiv.org/abs/2201.10005) (OpenAI embeddings)
372 |
373 | **Tier 3**
374 |
375 | - ✨ [Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting](http://arxiv.org/abs/2306.17563)
376 | - [Not All Vector Databases Are Made Equal](https://dmitry-kan.medium.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696)
377 | - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
378 | - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
379 | - [Task-aware Retrieval with Instructions](https://arxiv.org/abs/2211.09260)
380 |
381 | Tier 4+
382 |
383 | - ✨ [RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!](http://arxiv.org/abs/2312.02724)
384 | - ✨ [Some Common Mistakes In IR Evaluation, And How They Can Be Avoided](https://dl.acm.org/doi/10.1145/3190580.3190586)
385 | - [Boosting Search Engines with Interactive Agents](https://arxiv.org/abs/2109.00527)
386 | - [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)
387 | - [Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking](https://arxiv.org/abs/2212.01340)
388 | - [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
389 |
390 |
391 |
392 |
393 | ## ML in practice
394 |
395 | ### Production deployment
396 |
397 | **Tier 1**
398 |
399 | - [Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI](https://arxiv.org/abs/2002.04803v2)
400 | - [Machine Learning: The High Interest Credit Card of Technical Debt](https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf)
401 |
402 | **Tier 2**
403 |
404 | - ✨ [Designing Data-Intensive Applications](https://dataintensive.net/)
405 | - [A Recipe for Training Neural Networks](http://karpathy.github.io/2019/04/25/recipe/)
406 |
407 | ### Benchmarks
408 |
409 | **Tier 2**
410 |
411 | - ✨ [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](http://arxiv.org/abs/2311.12022)
412 | - ✨ [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770v1)
413 | - [TruthfulQA: Measuring How Models Mimic Human Falsehoods](https://arxiv.org/abs/2109.07958)
414 |
415 | **Tier 3**
416 |
417 | - [FLEX: Unifying Evaluation for Few-Shot NLP](https://arxiv.org/abs/2107.07170)
418 | - [Holistic Evaluation of Language Models](https://arxiv.org/abs/2107.07170) (HELM)
419 | - [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)
420 | - [RAFT: A Real-World Few-Shot Text Classification Benchmark](https://arxiv.org/abs/2109.14076)
421 | - [True Few-Shot Learning with Language Models](https://arxiv.org/abs/2105.11447)
422 |
423 | Tier 4+
424 |
425 | - ✨ [GAIA: a benchmark for General AI Assistants](http://arxiv.org/abs/2311.12983)
426 | - [ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers](https://arxiv.org/abs/2110.06884)
427 | - [Measuring Mathematical Problem Solving With the MATH Dataset](https://arxiv.org/abs/2103.03874)
428 | - [QuALITY: Question Answering with Long Input Texts, Yes!](https://arxiv.org/abs/2112.08608)
429 | - [SCROLLS: Standardized CompaRison Over Long Language Sequences](https://arxiv.org/abs/2201.03533)
430 | - [What Will it Take to Fix Benchmarking in Natural Language Understanding?](https://arxiv.org/abs/2104.02145)
431 |
432 |
433 |
434 | ### Datasets
435 |
436 | **Tier 2**
437 |
438 | - [Common Crawl](https://arxiv.org/abs/2105.02732)
439 | - [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027)
440 |
441 | **Tier 3**
442 |
443 | - [Dialog Inpainting: Turning Documents into Dialogs](https://arxiv.org/abs/2205.09073)
444 | - [MS MARCO: A Human Generated MAchine Reading COmprehension Dataset](https://arxiv.org/abs/1611.09268)
445 | - [Microsoft Academic Graph](https://internal-journal.frontiersin.org/articles/10.3389/fdata.2019.00045/full)
446 | - [TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts](https://arxiv.org/abs/2110.01159)
447 |
448 | ## Advanced topics
449 |
450 | ### World models and causality
451 |
452 | **Tier 3**
453 |
454 | - ✨ [Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task](http://arxiv.org/abs/2210.13382)
455 | - ✨ [From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought](http://arxiv.org/abs/2306.12672)
456 | - [Language Models Represent Space and Time](http://arxiv.org/abs/2310.02207)
457 |
458 | Tier 4+
459 |
460 | - ✨ [Amortizing intractable inference in large language models](http://arxiv.org/abs/2310.04363)
461 | - ✨ [CLADDER: Assessing Causal Reasoning in Language Models](http://zhijing-jin.com/files/papers/CLadder_2023.pdf)
462 | - ✨ [Causal Bayesian Optimization](https://proceedings.mlr.press/v108/aglietti20a.html)
463 | - ✨ [Causal Reasoning and Large Language Models: Opening a New Frontier for Causality](http://arxiv.org/abs/2305.00050)
464 | - ✨ [Generative Agents: Interactive Simulacra of Human Behavior](http://arxiv.org/abs/2304.03442)
465 | - ✨ [Passive learning of active causal strategies in agents and language models](http://arxiv.org/abs/2305.16183)
466 |
467 |
468 |
469 | ### Planning
470 |
471 | Tier 4+
472 |
473 | - ✨ [Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping](http://arxiv.org/abs/2402.14083)
474 | - ✨ [Cognitive Architectures for Language Agents](http://arxiv.org/abs/2309.02427)
475 |
476 |
477 |
478 | ### Uncertainty, calibration, and active learning
479 |
480 | **Tier 2**
481 |
482 | - ✨ [Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs](http://arxiv.org/abs/2402.08733)
483 | - [A Simple Baseline for Bayesian Uncertainty in Deep Learning](https://arxiv.org/abs/1902.02476)
484 | - [Plex: Towards Reliability using Pretrained Large Model Extensions](https://arxiv.org/abs/2207.07411)
485 |
486 | **Tier 3**
487 |
488 | - ✨ [Active Preference Inference using Language Models and Probabilistic Reasoning](http://arxiv.org/abs/2312.12009)
489 | - ✨ [Eliciting Human Preferences with Language Models](http://arxiv.org/abs/2310.11589)
490 | - [Active Learning by Acquiring Contrastive Examples](https://arxiv.org/abs/2109.03764)
491 | - [Describing Differences between Text Distributions with Natural Language](https://arxiv.org/abs/2201.12323)
492 | - [Teaching Models to Express Their Uncertainty in Words](https://arxiv.org/abs/2205.14334)
493 |
494 | Tier 4+
495 |
496 | - ✨ [Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning](http://arxiv.org/abs/2402.06025)
497 | - ✨ [STaR-GATE: Teaching Language Models to Ask Clarifying Questions](http://arxiv.org/abs/2403.19154)
498 | - [Active Testing: Sample-Efficient Model Evaluation](https://arxiv.org/abs/2103.05331)
499 | - [Uncertainty Estimation for Language Reward Models](https://arxiv.org/abs/2203.07472)
500 |
501 |
502 |
503 | ### Interpretability and model editing
504 |
505 | **Tier 2**
506 |
507 | - [Discovering Latent Knowledge in Language Models Without Supervision](https://arxiv.org/abs/2212.03827v1)
508 |
509 | **Tier 3**
510 |
511 | - ✨ [Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](http://arxiv.org/abs/2305.08809)
512 | - ✨ [Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks](http://arxiv.org/abs/2311.12786)
513 | - ✨ [Representation Engineering: A Top-Down Approach to AI Transparency](http://arxiv.org/abs/2310.01405)
514 | - ✨ [Studying Large Language Model Generalization with Influence Functions](http://arxiv.org/abs/2308.03296)
515 | - [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593)
516 |
517 | Tier 4+
518 |
519 | - ✨ [Codebook Features: Sparse and Discrete Interpretability for Neural Networks](http://arxiv.org/abs/2310.17230)
520 | - ✨ [Eliciting Latent Predictions from Transformers with the Tuned Lens](http://arxiv.org/abs/2303.08112)
521 | - ✨ [How do Language Models Bind Entities in Context?](http://arxiv.org/abs/2310.17191)
522 | - ✨ [Opening the AI black box: program synthesis via mechanistic interpretability](https://arxiv.org/abs/2402.05110v1)
523 | - ✨ [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](http://arxiv.org/abs/2403.19647)
524 | - ✨ [Uncovering mesa-optimization algorithms in Transformers](http://arxiv.org/abs/2309.05858)
525 | - [Fast Model Editing at Scale](https://arxiv.org/abs/2110.11309)
526 | - [Git Re-Basin: Merging Models modulo Permutation Symmetries](https://arxiv.org/abs/2209.04836)
527 | - [Locating and Editing Factual Associations in GPT](https://arxiv.org/abs/2202.05262)
528 | - [Mass-Editing Memory in a Transformer](https://arxiv.org/abs/2210.07229)
529 |
530 |
531 |
532 | ### Reinforcement learning
533 |
534 | **Tier 2**
535 |
536 | - ✨ [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](http://arxiv.org/abs/2305.18290)
537 | - ✨ [Reflexion: Language Agents with Verbal Reinforcement Learning](http://arxiv.org/abs/2303.11366)
538 | - [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815) (AlphaZero)
539 | - [MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
540 |
541 | **Tier 3**
542 |
543 | - ✨ [Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback](http://arxiv.org/abs/2307.15217)
544 | - [AlphaStar: mastering the real-time strategy game StarCraft II](https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii)
545 | - [Decision Transformer](https://arxiv.org/abs/2106.01345)
546 | - [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210) (EfficientZero)
547 | - [Mastering Stratego, the classic game of imperfect information](https://www.science.org/doi/10.1126/science.add4679) (DeepNash)
548 |
549 | Tier 4+
550 |
551 | - ✨ [AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning](http://arxiv.org/abs/2308.03526)
552 | - ✨ [Bayesian Reinforcement Learning with Limited Cognitive Load](http://arxiv.org/abs/2305.03263)
553 | - ✨ [Contrastive Prefence Learning: Learning from Human Feedback without RL](http://arxiv.org/abs/2310.13639)
554 | - ✨ [Grandmaster-Level Chess Without Search](http://arxiv.org/abs/2402.04494)
555 | - [A data-driven approach for learning to control computers](https://arxiv.org/abs/2202.08137)
556 | - [Acquisition of Chess Knowledge in AlphaZero](https://arxiv.org/abs/2111.09259)
557 | - [Player of Games](https://arxiv.org/abs/2112.03178)
558 | - [Retrieval-Augmented Reinforcement Learning](https://arxiv.org/abs/2202.08417)
559 |
560 |
561 |
562 | ## The big picture
563 |
564 | ### AI scaling
565 |
566 | **Tier 1**
567 |
568 | - [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
569 | - [Takeoff speeds](https://sideways-view.com/2018/02/24/takeoff-speeds/)
570 | - [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)
571 |
572 | **Tier 2**
573 |
574 | - [AI and compute](https://openai.com/blog/ai-and-compute/)
575 | - [Scaling Laws for Transfer](https://arxiv.org/abs/2102.01293)
576 | - [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) (Chinchilla)
577 |
578 | **Tier 3**
579 |
580 | - [Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
581 | - [Transcending Scaling Laws with 0.1% Extra Compute](https://arxiv.org/abs/2210.11399) (U-PaLM)
582 |
583 | Tier 4+
584 |
585 | - ✨ [Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws](http://arxiv.org/abs/2404.05405)
586 | - ✨ [Power Law Trends in Speedrunning and Machine Learning](http://arxiv.org/abs/2304.10004)
587 | - ✨ [Scaling laws for single-agent reinforcement learning](http://arxiv.org/abs/2301.13442)
588 | - [Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)
589 | - [Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
590 | - [Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)
591 |
592 |
593 |
594 | ### AI safety
595 |
596 | **Tier 1**
597 |
598 | - [Three impacts of machine intelligence](https://www.effectivealtruism.org/articles/three-impacts-of-machine-intelligence-paul-christiano/)
599 | - [What failure looks like](https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like)
600 | - [Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover](https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to)
601 |
602 | **Tier 2**
603 |
604 | - ✨ [An Overview of Catastrophic AI Risks](http://arxiv.org/abs/2306.12001)
605 | - [Clarifying “What failure looks like” (part 1)](https://www.lesswrong.com/posts/v6Q7T335KCMxujhZu/clarifying-what-failure-looks-like-part-1)
606 | - [Deep RL from human preferences](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/)
607 | - [The alignment problem from a deep learning perspective](https://arxiv.org/abs/2209.00626)
608 |
609 | **Tier 3**
610 |
611 | - ✨ [Scheming AIs: Will AIs fake alignment during training in order to get power?](http://arxiv.org/abs/2311.08379)
612 | - [Measuring Progress on Scalable Oversight for Large Language Models](https://arxiv.org/abs/2211.03540)
613 | - [Risks from Learned Optimization in Advanced Machine Learning Systems](https://arxiv.org/abs/1906.01820)
614 | - [Scalable agent alignment via reward modelling](https://arxiv.org/abs/1811.07871)
615 |
616 | Tier 4+
617 |
618 | - ✨ [AI Deception: A Survey of Examples, Risks, and Potential Solutions](http://arxiv.org/abs/2308.14752)
619 | - ✨ [Benchmarks for Detecting Measurement Tampering](http://arxiv.org/abs/2308.15605)
620 | - ✨ [Chess as a Testing Grounds for the Oracle Approach to AI Safety](http://arxiv.org/abs/2010.02911)
621 | - ✨ [Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence](https://papers.ssrn.com/abstract=4608505)
622 | - ✨ [Model evaluation for extreme risks](http://arxiv.org/abs/2305.15324)
623 | - ✨ [Responsible Reporting for Frontier AI Development](http://arxiv.org/abs/2404.02675)
624 | - ✨ [Safety Cases: How to Justify the Safety of Advanced AI Systems](http://arxiv.org/abs/2403.10462)
625 | - ✨ [Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](http://arxiv.org/abs/2401.05566)
626 | - ✨ [Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure](http://arxiv.org/abs/2311.07590)
627 | - ✨ [Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game](http://arxiv.org/abs/2311.01011)
628 | - ✨ [Tools for Verifying Neural Models' Training Data](http://arxiv.org/abs/2307.00682)
629 | - ✨ [Towards a Cautious Scientist AI with Convergent Safety Bounds](https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/)
630 | - [Alignment of Language Agents](https://arxiv.org/abs/2103.14659)
631 | - [Eliciting Latent Knowledge](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing)
632 | - [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858)
633 | - [Red Teaming Language Models with Language Models](https://storage.googleapis.com/deepmind-media/Red%20Teaming/Red%20Teaming.pdf)
634 | - [Unsolved Problems in ML Safety](https://arxiv.org/abs/2109.13916)
635 |
636 |
637 |
638 | ### Economic and social impacts
639 |
640 | **Tier 3**
641 |
642 | - ✨ [Explosive growth from AI automation: A review of the arguments](http://arxiv.org/abs/2309.11690)
643 | - ✨ [Language Models Can Reduce Asymmetry in Information Markets](http://arxiv.org/abs/2403.14443)
644 |
645 | Tier 4+
646 |
647 | - ✨ [Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero](http://arxiv.org/abs/2310.16410)
648 | - ✨ [Foundation Models and Fair Use](https://arxiv.org/abs/2303.15715v1)
649 | - ✨ [GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models](http://arxiv.org/abs/2303.10130)
650 | - ✨ [Levels of AGI: Operationalizing Progress on the Path to AGI](http://arxiv.org/abs/2311.02462)
651 | - ✨ [Opportunities and Risks of LLMs for Scalable Deliberation with Polis](http://arxiv.org/abs/2306.11932)
652 | - [On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258)
653 |
654 |
655 |
656 | ### Philosophy
657 |
658 | **Tier 2**
659 |
660 | - [Meaning without reference in large language models](https://arxiv.org/abs/2208.02957)
661 |
662 | Tier 4+
663 |
664 | - ✨ [Consciousness in Artificial Intelligence: Insights from the Science of Consciousness](http://arxiv.org/abs/2308.08708)
665 | - ✨ [Philosophers Ought to Develop, Theorize About, and Use Philosophically Relevant AI](https://philarchive.org/archive/CLAPOT-16)
666 | - ✨ [Towards Evaluating AI Systems for Moral Status Using Self-Reports](http://arxiv.org/abs/2311.08576)
667 |
668 |
669 |
670 | ## Maintainer
671 |
672 | [andreas@elicit.com](mailto:andreas@elicit.com)
673 |
674 |
--------------------------------------------------------------------------------