└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Elicit Machine Learning Reading List
  2 | 
  3 | ## Purpose
  4 | 
  5 | The purpose of this curriculum is to help new [Elicit](https://elicit.com/) employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
  6 | 
  7 | If you don’t work at Elicit yet - we’re [hiring ML and software engineers](https://elicit.com/careers).
  8 | 
  9 | ## How to read
 10 | 
 11 | Recommended reading order:
 12 | 
 13 | 1. Read “Tier 1” for all topics
 14 | 2. Read “Tier 2” for all topics
 15 | 3. Etc
 16 | 
 17 | ✨ Added after 2024/4/1
 18 | 
 19 | ## Table of contents
 20 | 
 21 | - [Fundamentals](#fundamentals)
 22 |   * [Introduction to machine learning](#introduction-to-machine-learning)
 23 |   * [Transformers](#transformers)
 24 |   * [Key foundation model architectures](#key-foundation-model-architectures)
 25 |   * [Training and finetuning](#training-and-finetuning)
 26 | - [Reasoning and runtime strategies](#reasoning-and-runtime-strategies)
 27 |   * [In-context reasoning](#in-context-reasoning)
 28 |   * [Task decomposition](#task-decomposition)
 29 |   * [Debate](#debate)
 30 |   * [Tool use and scaffolding](#tool-use-and-scaffolding)
 31 |   * [Honesty, factuality, and epistemics](#honesty-factuality-and-epistemics)
 32 | - [Applications](#applications)
 33 |   * [Science](#science)
 34 |   * [Forecasting](#forecasting)
 35 |   * [Search and ranking](#search-and-ranking)
 36 | - [ML in practice](#ml-in-practice)
 37 |   * [Production deployment](#production-deployment)
 38 |   * [Benchmarks](#benchmarks)
 39 |   * [Datasets](#datasets)
 40 | - [Advanced topics](#advanced-topics)
 41 |   * [World models and causality](#world-models-and-causality)
 42 |   * [Planning](#planning)
 43 |   * [Uncertainty, calibration, and active learning](#uncertainty-calibration-and-active-learning)
 44 |   * [Interpretability and model editing](#interpretability-and-model-editing)
 45 |   * [Reinforcement learning](#reinforcement-learning)
 46 | - [The big picture](#the-big-picture)
 47 |   * [AI scaling](#ai-scaling)
 48 |   * [AI safety](#ai-safety)
 49 |   * [Economic and social impacts](#economic-and-social-impacts)
 50 |   * [Philosophy](#philosophy)
 51 | - [Maintainer](#maintainer)
 52 | 
 53 | ## Fundamentals
 54 | 
 55 | ### Introduction to machine learning
 56 | 
 57 | **Tier 1**
 58 | 
 59 | - [A short introduction to machine learning](https://www.alignmentforum.org/posts/qE73pqxAZmeACsAdF/a-short-introduction-to-machine-learning)
 60 | - [But what is a neural network?](https://www.youtube.com/watch?v=aircAruvnKk&t=0s)
 61 | - [Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w)
 62 | 
 63 | **Tier 2**
 64 | 
 65 | - ✨ [An intuitive understanding of backpropagation](https://cs231n.github.io/optimization-2/)
 66 | - [What is backpropagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)
 67 | - [An introduction to deep reinforcement learning](https://thomassimonini.medium.com/an-introduction-to-deep-reinforcement-learning-17a565999c0c)
 68 | 
 69 | **Tier 3**
 70 | 
 71 | - [The spelled-out intro to neural networks and backpropagation: building micrograd](https://www.youtube.com/watch?v=VMj-3S1tku0)
 72 | - [Backpropagation calculus](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=5)
 73 | 
 74 | ### Transformers
 75 | 
 76 | **Tier 1**
 77 | 
 78 | - ✨ [But what is a GPT? Visual intro to transformers](https://www.youtube.com/watch?v=wjZofJX0v4M)
 79 | - ✨ [Attention in transformers, visually explained](https://www.youtube.com/watch?v=eMlx5fFNoYc)
 80 | - ✨ [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention/)
 81 | - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
 82 | - [The Illustrated GPT-2 (Visualizing Transformer Language Models)](https://jalammar.github.io/illustrated-gpt2/)
 83 | 
 84 | **Tier 2**
 85 | 
 86 | - ✨ [Let's build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)
 87 | - ✨ [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473)
 88 | - [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)
 89 | - [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
 90 | 
 91 | **Tier 3**
 92 | 
 93 | - [A Practical Survey on Faster and Lighter Transformers](https://arxiv.org/abs/2103.14636)
 94 | - [TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second](https://arxiv.org/abs/2207.01848)
 95 | - [Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets](https://arxiv.org/abs/2201.02177)
 96 | - [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)
 97 | 
 98 | <details><summary><strong>Tier 4+</strong></summary>
 99 | 
100 | - ✨ [Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks](https://arxiv.org/abs/2311.12997)
101 | - [Memorizing Transformers](https://arxiv.org/abs/2203.08913)
102 | - [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913)
103 | 
104 | </details>
105 | 
106 | ### Key foundation model architectures
107 | 
108 | **Tier 1**
109 | 
110 | - [Language Models are Unsupervised Multitask Learners](https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe) (GPT-2)
111 | - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (GPT-3)
112 | 
113 | **Tier 2**
114 | 
115 | - ✨ [LLaMA: Open and Efficient Foundation Language Models](http://arxiv.org/abs/2302.13971) (LLaMA)
116 | - ✨ [Efficiently Modeling Long Sequences with Structured State Spaces](https://arxiv.org/abs/2111.00396) ([video](https://www.youtube.com/watch?v=EvQ3ncuriCM)) (S4)
117 | - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) (T5)
118 | - [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) (OpenAI Codex)
119 | - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) (OpenAI Instruct)
120 | 
121 | **Tier 3**
122 | 
123 | - ✨ [Mistral 7B](http://arxiv.org/abs/2310.06825) (Mistral)
124 | - ✨ [Mixtral of Experts](http://arxiv.org/abs/2401.04088) (Mixtral)
125 | - ✨ [Gemini: A Family of Highly Capable Multimodal Models](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) (Gemini)
126 | - ✨ [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752v1) (Mamba)
127 | - [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416) (Flan)
128 | 
129 | <details><summary><strong>Tier 4+</strong></summary>
130 | 
131 | - ✨ [Consistency Models](http://arxiv.org/abs/2303.01469)
132 | - ✨ [Model Card and Evaluations for Claude Models](https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf) (Claude 2)
133 | - ✨ [OLMo: Accelerating the Science of Language Models](http://arxiv.org/abs/2402.00838)
134 | - ✨ [PaLM 2 Technical Report](https://arxiv.org/abs/2305.10403) (Palm 2)
135 | - ✨ [Textbooks Are All You Need II: phi-1.5 technical report](http://arxiv.org/abs/2309.05463) (phi 1.5)
136 | - ✨ [Visual Instruction Tuning](http://arxiv.org/abs/2304.08485) (LLaVA)
137 | - [A General Language Assistant as a Laboratory for Alignment](https://arxiv.org/abs/2112.00861)
138 | - [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652) (Google Instruct)
139 | - [Galactica: A Large Language Model for Science](https://arxiv.org/abs/2211.09085)
140 | - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) (Google Dialog)
141 | - [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2112.11446) (Meta GPT-3)
142 | - [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311) (PaLM)
143 | - [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) (Google Codex)
144 | - [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) (Gopher)
145 | - [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858) (Minerva)
146 | - [UL2: Unifying Language Learning Paradigms](http://aima.cs.berkeley.edu/) (UL2)
147 | 
148 | </details>
149 | 
150 | ### Training and finetuning
151 | 
152 | **Tier 2**
153 | 
154 | - ✨ [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466)
155 | - [Learning to summarise with human feedback](https://arxiv.org/abs/2009.01325)
156 | - [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168)
157 | 
158 | **Tier 3**
159 | 
160 | - ✨ [Pretraining Language Models with Human Preferences](http://arxiv.org/abs/2302.08582)
161 | - ✨ [Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision](http://arxiv.org/abs/2312.09390)
162 | - [Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning](https://arxiv.org/abs/2205.05638v1)
163 | - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
164 | - [Unsupervised Neural Machine Translation with Generative Language Models Only](https://arxiv.org/abs/2110.05448)
165 | 
166 | <details><summary><strong>Tier 4+</strong></summary>
167 | 
168 | - ✨ [Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](http://arxiv.org/abs/2312.06585)
169 | - ✨ [Improving Code Generation by Training with Natural Language Feedback](http://arxiv.org/abs/2303.16749)
170 | - ✨ [Language Modeling Is Compression](https://arxiv.org/abs/2309.10668v1)
171 | - ✨ [LIMA: Less Is More for Alignment](http://arxiv.org/abs/2305.11206)
172 | - ✨ [Learning to Compress Prompts with Gist Tokens](http://arxiv.org/abs/2304.08467)
173 | - ✨ [Lost in the Middle: How Language Models Use Long Contexts](http://arxiv.org/abs/2307.03172)
174 | - ✨ [QLoRA: Efficient Finetuning of Quantized LLMs](http://arxiv.org/abs/2305.14314)
175 | - ✨ [Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](http://arxiv.org/abs/2403.09629)
176 | - ✨ [Reinforced Self-Training (ReST) for Language Modeling](http://arxiv.org/abs/2308.08998)
177 | - ✨ [Solving olympiad geometry without human demonstrations](https://www.nature.com/articles/s41586-023-06747-5)
178 | - ✨ [Tell, don't show: Declarative facts influence how LLMs generalize](http://arxiv.org/abs/2312.07779)
179 | - ✨ [Textbooks Are All You Need](http://arxiv.org/abs/2306.11644)
180 | - ✨ [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](http://arxiv.org/abs/2305.07759)
181 | - ✨ [Training Language Models with Language Feedback at Scale](http://arxiv.org/abs/2303.16755)
182 | - ✨ [Turing Complete Transformers: Two Transformers Are More Powerful Than One](https://openreview.net/forum?id=MGWsPGogLH)
183 | - [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626)
184 | - [Data Distributional Properties Drive Emergent In-Context Learning in Transformers](https://arxiv.org/abs/2205.05055)
185 | - [Diffusion-LM Improves Controllable Text Generation](https://arxiv.org/abs/2205.14217)
186 | - [ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/abs/2107.02137)
187 | - [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255)
188 | - [ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning](https://arxiv.org/abs/2111.10952)
189 | - [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/abs/2101.00190)
190 | - [Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning](https://arxiv.org/abs/2106.02584)
191 | - [True Few-Shot Learning with Prompts -- A Real-World Perspective](https://arxiv.org/abs/2111.13440)
192 | 
193 | </details>
194 | 
195 | ## Reasoning and runtime strategies
196 | 
197 | ### In-context reasoning
198 | 
199 | **Tier 2**
200 | 
201 | - [Chain of Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
202 | - [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916) (Let's think step by step)
203 | - [Self-Consistency Improves Chain of Thought Reasoning in Language Models](https://arxiv.org/abs/2203.11171)
204 | 
205 | **Tier 3**
206 | 
207 | - ✨ [Chain-of-Thought Reasoning Without Prompting](http://arxiv.org/abs/2402.10200)
208 | - ✨ [Why think step-by-step? Reasoning emerges from the locality of experience](http://arxiv.org/abs/2304.03843)
209 | 
210 | <details><summary><strong>Tier 4+</strong></summary>
211 | 
212 | - ✨ [Baldur: Whole-Proof Generation and Repair with Large Language Models](https://arxiv.org/abs/2303.04910v1)
213 | - ✨ [Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought](http://arxiv.org/abs/2403.05518)
214 | - ✨ [Certified Reasoning with Language Models](http://arxiv.org/abs/2306.04031)
215 | - ✨ [Hypothesis Search: Inductive Reasoning with Language Models](http://arxiv.org/abs/2309.05660)
216 | - ✨ [LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations](http://arxiv.org/abs/2305.18354)
217 | - ✨ [Large Language Models Cannot Self-Correct Reasoning Yet](https://arxiv.org/abs/2310.01798v1)
218 | - ✨ [Stream of Search (SoS): Learning to Search in Language](http://arxiv.org/abs/2404.03683)
219 | - ✨ [Training Chain-of-Thought via Latent-Variable Inference](http://arxiv.org/abs/2312.02179)
220 | - [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837)
221 | - [Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right](https://arxiv.org/abs/2104.08315)
222 | 
223 | </details>
224 | 
225 | ### Task decomposition
226 | 
227 | **Tier 1**
228 | 
229 | - [Supervise Process, not Outcomes](https://ought.org/updates/2022-04-06-process)
230 | - [Supervising strong learners by amplifying weak experts](https://arxiv.org/abs/1810.08575)
231 | 
232 | **Tier 2**
233 | 
234 | - ✨ [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](http://arxiv.org/abs/2305.10601)
235 | - [Factored cognition](https://ought.org/research/factored-cognition)
236 | - [Iterated Distillation and Amplification](https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616)
237 | - [Recursively Summarizing Books with Human Feedback](https://arxiv.org/abs/2109.10862)
238 | - [Solving math word problems with process-based and outcome-based feedback](https://arxiv.org/abs/2211.14275)
239 | 
240 | **Tier 3**
241 | 
242 | - ✨ [Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers](https://arxiv.org/abs/2310.10627)
243 | - [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271)
244 | - [Humans consulting HCH](https://ai-alignment.com/humans-consulting-hch-f893f6051455)
245 | - [Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes](https://arxiv.org/abs/2301.01751)
246 | - [Language Model Cascades](https://arxiv.org/abs/2207.10342)
247 | 
248 | <details><summary><strong>Tier 4+</strong></summary>
249 | 
250 | - ✨ [Decontextualization: Making Sentences Stand-Alone](https://doi.org/10.1162/tacl_a_00377)
251 | - ✨ [Factored Cognition Primer](https://primer.ought.org)
252 | - ✨ [Graph of Thoughts: Solving Elaborate Problems with Large Language Models](http://arxiv.org/abs/2308.09687)
253 | - ✨ [Parsel: A Unified Natural Language Framework for Algorithmic Reasoning](http://arxiv.org/abs/2212.10561)
254 | - [AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts](https://arxiv.org/abs/2110.01691)
255 | - [Challenging BIG-Bench tasks and whether chain-of-thought can solve them](https://arxiv.org/abs/2210.09261)
256 | - [Evaluating Arguments One Step at a Time](https://ought.org/updates/2020-01-11-arguments)
257 | - [Least-to-Most Prompting Enables Complex Reasoning in Large Language Models](https://arxiv.org/abs/2205.11822)
258 | - [Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations](https://arxiv.org/abs/2205.11822)
259 | - [Measuring and narrowing the compositionality gap in language models](https://arxiv.org/abs/2210.03350)
260 | - [PAL: Program-aided Language Models](https://arxiv.org/abs/2211.10435)
261 | - [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)
262 | - [Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning](https://arxiv.org/abs/2205.10625)
263 | - [Show Your Work: Scratchpads for Intermediate Computation with Language Models](https://arxiv.org/abs/2112.00114)
264 | - [Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents](https://arxiv.org/abs/2110.10150)
265 | - [Thinksum: probabilistic reasoning over sets using large language models](https://arxiv.org/abs/2210.01293)
266 | 
267 | </details>
268 | 
269 | ### Debate
270 | 
271 | **Tier 2**
272 | 
273 | - [AI safety via debate](https://openai.com/blog/debate/)
274 | 
275 | **Tier 3**
276 | 
277 | - ✨ [Debate Helps Supervise Unreliable Experts](https://twitter.com/joshua_clymer/status/1724851456967417872)
278 | - [Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions](https://arxiv.org/abs/2210.10860)
279 | 
280 | <details><summary><strong>Tier 4+</strong></summary>
281 | 
282 | - ✨ [Scalable AI Safety via Doubly-Efficient Debate](http://arxiv.org/abs/2311.14125)
283 | - ✨ [Improving Factuality and Reasoning in Language Models through Multiagent Debate](http://arxiv.org/abs/2305.14325)
284 | 
285 | </details>
286 | 
287 | ### Tool use and scaffolding
288 | 
289 | **Tier 2**
290 | 
291 | - ✨ [Measuring the impact of post-training enhancements](https://metr.github.io/autonomy-evals-guide/elicitation-gap/)
292 | - [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332)
293 | 
294 | **Tier 3**
295 | 
296 | - ✨ [AI capabilities can be significantly improved without expensive retraining](http://arxiv.org/abs/2312.07413)
297 | - ✨ [Automated Statistical Model Discovery with Language Models](http://arxiv.org/abs/2402.17879)
298 | 
299 | <details><summary><strong>Tier 4+</strong></summary>
300 | 
301 | - ✨ [DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](http://arxiv.org/abs/2310.03714)
302 | - ✨ [Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution](http://arxiv.org/abs/2309.16797)
303 | - ✨ [Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation](https://arxiv.org/abs/2310.02304v1)
304 | - ✨ [Voyager: An Open-Ended Embodied Agent with Large Language Models](http://arxiv.org/abs/2305.16291)
305 | - [ReGAL: Refactoring Programs to Discover Generalizable Abstractions](http://arxiv.org/abs/2401.16467)
306 | 
307 | </details>
308 | 
309 | ### Honesty, factuality, and epistemics
310 | 
311 | **Tier 2**
312 | 
313 | - ✨ [Self-critiquing models for assisting human evaluators](https://arxiv.org/abs/2206.05802v2)
314 | 
315 | **Tier 3**
316 | 
317 | - ✨ [What Evidence Do Language Models Find Convincing?](http://arxiv.org/abs/2402.11782)
318 | - ✨ [How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions](https://arxiv.org/abs/2309.15840)
319 | 
320 | <details><summary><strong>Tier 4+</strong></summary>
321 | 
322 | - ✨ [Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting](http://arxiv.org/abs/2305.04388)
323 | - ✨ [Long-form factuality in large language models](http://arxiv.org/abs/2403.18802)
324 | 
325 | </details>
326 | 
327 | ## Applications
328 | 
329 | ### Science
330 | 
331 | **Tier 3**
332 | 
333 | - ✨ [Can large language models provide useful feedback on research papers? A large-scale empirical analysis](http://arxiv.org/abs/2310.01783)
334 | - ✨ [Large Language Models Encode Clinical Knowledge](http://arxiv.org/abs/2212.13138)
335 | - ✨ [The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4](http://arxiv.org/abs/2311.07361)
336 | - [A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers](https://arxiv.org/abs/2105.03011)
337 | 
338 | <details><summary><strong>Tier 4+</strong></summary>
339 | 
340 | - ✨ [Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine](http://arxiv.org/abs/2311.16452)
341 | - ✨ [Nougat: Neural Optical Understanding for Academic Documents](http://arxiv.org/abs/2308.13418)
342 | - ✨ [Scim: Intelligent Skimming Support for Scientific Papers](http://arxiv.org/abs/2205.04561)
343 | - ✨ [SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design](https://www.biorxiv.org/content/10.1101/2023.07.06.547759v1)
344 | - ✨ [Towards Accurate Differential Diagnosis with Large Language Models](http://arxiv.org/abs/2312.00164)
345 | - ✨ [Towards a Benchmark for Scientific Understanding in Humans and Machines](http://arxiv.org/abs/2304.10327)
346 | - [A Search Engine for Discovery of Scientific Challenges and Directions](https://arxiv.org/abs/2108.13751)
347 | - [A full systematic review was completed in 2 weeks using automation tools: a case study](https://pubmed.ncbi.nlm.nih.gov/32004673/)
348 | - [Fact or Fiction: Verifying Scientific Claims](https://arxiv.org/abs/2004.14974)
349 | - [Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles](https://arxiv.org/abs/2010.14235)
350 | - [PEER: A Collaborative Language Model](https://arxiv.org/abs/2208.11663)
351 | - [PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/abs/1909.06146)
352 | - [SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts](https://arxiv.org/abs/2104.08809)
353 | - [SciTail: A Textual Entailment Dataset from Science Question Answering](http://ai2-website.s3.amazonaws.com/team/ashishs/scitail-aaai2018.pdf)
354 | 
355 | </details>
356 | 
357 | ### Forecasting
358 | 
359 | **Tier 3**
360 | 
361 | - ✨ [AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy](https://arxiv.org/abs/2402.07862v1)
362 | - ✨ [Approaching Human-Level Forecasting with Language Models](http://arxiv.org/abs/2402.18563)
363 | - ✨ [Are Transformers Effective for Time Series Forecasting?](https://arxiv.org/abs/2205.13504)
364 | - [Forecasting Future World Events with Neural Networks](https://arxiv.org/abs/2206.15474)
365 | 
366 | ### Search and ranking
367 | 
368 | **Tier 2**
369 | 
370 | - [Learning Dense Representations of Phrases at Scale](https://arxiv.org/abs/2012.12624)
371 | - [Text and Code Embeddings by Contrastive Pre-Training](https://arxiv.org/abs/2201.10005) (OpenAI embeddings)
372 | 
373 | **Tier 3**
374 | 
375 | - ✨ [Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting](http://arxiv.org/abs/2306.17563)
376 | - [Not All Vector Databases Are Made Equal](https://dmitry-kan.medium.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696)
377 | - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
378 | - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
379 | - [Task-aware Retrieval with Instructions](https://arxiv.org/abs/2211.09260)
380 | 
381 | <details><summary><strong>Tier 4+</strong></summary>
382 | 
383 | - ✨ [RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!](http://arxiv.org/abs/2312.02724)
384 | - ✨ [Some Common Mistakes In IR Evaluation, And How They Can Be Avoided](https://dl.acm.org/doi/10.1145/3190580.3190586)
385 | - [Boosting Search Engines with Interactive Agents](https://arxiv.org/abs/2109.00527)
386 | - [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)
387 | - [Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking](https://arxiv.org/abs/2212.01340)
388 | - [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
389 | 
390 | </details>
391 | 
392 | 
393 | ## ML in practice
394 | 
395 | ### Production deployment
396 | 
397 | **Tier 1**
398 | 
399 | - [Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI](https://arxiv.org/abs/2002.04803v2)
400 | - [Machine Learning: The High Interest Credit Card of Technical Debt](https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf)
401 | 
402 | **Tier 2**
403 | 
404 | - ✨ [Designing Data-Intensive Applications](https://dataintensive.net/)
405 | - [A Recipe for Training Neural Networks](http://karpathy.github.io/2019/04/25/recipe/)
406 | 
407 | ### Benchmarks
408 | 
409 | **Tier 2**
410 | 
411 | - ✨ [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](http://arxiv.org/abs/2311.12022)
412 | - ✨ [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770v1)
413 | - [TruthfulQA: Measuring How Models Mimic Human Falsehoods](https://arxiv.org/abs/2109.07958)
414 | 
415 | **Tier 3**
416 | 
417 | - [FLEX: Unifying Evaluation for Few-Shot NLP](https://arxiv.org/abs/2107.07170)
418 | - [Holistic Evaluation of Language Models](https://arxiv.org/abs/2107.07170) (HELM)
419 | - [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)
420 | - [RAFT: A Real-World Few-Shot Text Classification Benchmark](https://arxiv.org/abs/2109.14076)
421 | - [True Few-Shot Learning with Language Models](https://arxiv.org/abs/2105.11447)
422 | 
423 | <details><summary><strong>Tier 4+</strong></summary>
424 | 
425 | - ✨ [GAIA: a benchmark for General AI Assistants](http://arxiv.org/abs/2311.12983)
426 | - [ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers](https://arxiv.org/abs/2110.06884)
427 | - [Measuring Mathematical Problem Solving With the MATH Dataset](https://arxiv.org/abs/2103.03874)
428 | - [QuALITY: Question Answering with Long Input Texts, Yes!](https://arxiv.org/abs/2112.08608)
429 | - [SCROLLS: Standardized CompaRison Over Long Language Sequences](https://arxiv.org/abs/2201.03533)
430 | - [What Will it Take to Fix Benchmarking in Natural Language Understanding?](https://arxiv.org/abs/2104.02145)
431 | 
432 | </details>
433 | 
434 | ### Datasets
435 | 
436 | **Tier 2**
437 | 
438 | - [Common Crawl](https://arxiv.org/abs/2105.02732)
439 | - [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027)
440 | 
441 | **Tier 3**
442 | 
443 | - [Dialog Inpainting: Turning Documents into Dialogs](https://arxiv.org/abs/2205.09073)
444 | - [MS MARCO: A Human Generated MAchine Reading COmprehension Dataset](https://arxiv.org/abs/1611.09268)
445 | - [Microsoft Academic Graph](https://internal-journal.frontiersin.org/articles/10.3389/fdata.2019.00045/full)
446 | - [TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts](https://arxiv.org/abs/2110.01159)
447 | 
448 | ## Advanced topics
449 | 
450 | ### World models and causality
451 | 
452 | **Tier 3**
453 | 
454 | - ✨ [Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task](http://arxiv.org/abs/2210.13382)
455 | - ✨ [From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought](http://arxiv.org/abs/2306.12672)
456 | - [Language Models Represent Space and Time](http://arxiv.org/abs/2310.02207)
457 | 
458 | <details><summary><strong>Tier 4+</strong></summary>
459 | 
460 | - ✨ [Amortizing intractable inference in large language models](http://arxiv.org/abs/2310.04363)
461 | - ✨ [CLADDER: Assessing Causal Reasoning in Language Models](http://zhijing-jin.com/files/papers/CLadder_2023.pdf)
462 | - ✨ [Causal Bayesian Optimization](https://proceedings.mlr.press/v108/aglietti20a.html)
463 | - ✨ [Causal Reasoning and Large Language Models: Opening a New Frontier for Causality](http://arxiv.org/abs/2305.00050)
464 | - ✨ [Generative Agents: Interactive Simulacra of Human Behavior](http://arxiv.org/abs/2304.03442)
465 | - ✨ [Passive learning of active causal strategies in agents and language models](http://arxiv.org/abs/2305.16183)
466 | 
467 | </details>
468 | 
469 | ### Planning
470 | 
471 | <details><summary><strong>Tier 4+</strong></summary>
472 | 
473 | - ✨ [Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping](http://arxiv.org/abs/2402.14083)
474 | - ✨ [Cognitive Architectures for Language Agents](http://arxiv.org/abs/2309.02427)
475 | 
476 | </details>
477 | 
478 | ### Uncertainty, calibration, and active learning
479 | 
480 | **Tier 2**
481 | 
482 | - ✨ [Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs](http://arxiv.org/abs/2402.08733)
483 | - [A Simple Baseline for Bayesian Uncertainty in Deep Learning](https://arxiv.org/abs/1902.02476)
484 | - [Plex: Towards Reliability using Pretrained Large Model Extensions](https://arxiv.org/abs/2207.07411)
485 | 
486 | **Tier 3**
487 | 
488 | - ✨ [Active Preference Inference using Language Models and Probabilistic Reasoning](http://arxiv.org/abs/2312.12009)
489 | - ✨ [Eliciting Human Preferences with Language Models](http://arxiv.org/abs/2310.11589)
490 | - [Active Learning by Acquiring Contrastive Examples](https://arxiv.org/abs/2109.03764)
491 | - [Describing Differences between Text Distributions with Natural Language](https://arxiv.org/abs/2201.12323)
492 | - [Teaching Models to Express Their Uncertainty in Words](https://arxiv.org/abs/2205.14334)
493 | 
494 | <details><summary><strong>Tier 4+</strong></summary>
495 | 
496 | - ✨ [Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning](http://arxiv.org/abs/2402.06025)
497 | - ✨ [STaR-GATE: Teaching Language Models to Ask Clarifying Questions](http://arxiv.org/abs/2403.19154)
498 | - [Active Testing: Sample-Efficient Model Evaluation](https://arxiv.org/abs/2103.05331)
499 | - [Uncertainty Estimation for Language Reward Models](https://arxiv.org/abs/2203.07472)
500 | 
501 | </details>
502 | 
503 | ### Interpretability and model editing
504 | 
505 | **Tier 2**
506 | 
507 | - [Discovering Latent Knowledge in Language Models Without Supervision](https://arxiv.org/abs/2212.03827v1)
508 | 
509 | **Tier 3**
510 | 
511 | - ✨ [Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](http://arxiv.org/abs/2305.08809)
512 | - ✨ [Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks](http://arxiv.org/abs/2311.12786)
513 | - ✨ [Representation Engineering: A Top-Down Approach to AI Transparency](http://arxiv.org/abs/2310.01405)
514 | - ✨ [Studying Large Language Model Generalization with Influence Functions](http://arxiv.org/abs/2308.03296)
515 | - [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593)
516 | 
517 | <details><summary><strong>Tier 4+</strong></summary>
518 | 
519 | - ✨ [Codebook Features: Sparse and Discrete Interpretability for Neural Networks](http://arxiv.org/abs/2310.17230)
520 | - ✨ [Eliciting Latent Predictions from Transformers with the Tuned Lens](http://arxiv.org/abs/2303.08112)
521 | - ✨ [How do Language Models Bind Entities in Context?](http://arxiv.org/abs/2310.17191)
522 | - ✨ [Opening the AI black box: program synthesis via mechanistic interpretability](https://arxiv.org/abs/2402.05110v1)
523 | - ✨ [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](http://arxiv.org/abs/2403.19647)
524 | - ✨ [Uncovering mesa-optimization algorithms in Transformers](http://arxiv.org/abs/2309.05858)
525 | - [Fast Model Editing at Scale](https://arxiv.org/abs/2110.11309)
526 | - [Git Re-Basin: Merging Models modulo Permutation Symmetries](https://arxiv.org/abs/2209.04836)
527 | - [Locating and Editing Factual Associations in GPT](https://arxiv.org/abs/2202.05262)
528 | - [Mass-Editing Memory in a Transformer](https://arxiv.org/abs/2210.07229)
529 | 
530 | </details>
531 | 
532 | ### Reinforcement learning
533 | 
534 | **Tier 2**
535 | 
536 | - ✨ [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](http://arxiv.org/abs/2305.18290)
537 | - ✨ [Reflexion: Language Agents with Verbal Reinforcement Learning](http://arxiv.org/abs/2303.11366)
538 | - [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815) (AlphaZero)
539 | - [MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
540 | 
541 | **Tier 3**
542 | 
543 | - ✨ [Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback](http://arxiv.org/abs/2307.15217)
544 | - [AlphaStar: mastering the real-time strategy game StarCraft II](https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii)
545 | - [Decision Transformer](https://arxiv.org/abs/2106.01345)
546 | - [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210) (EfficientZero)
547 | - [Mastering Stratego, the classic game of imperfect information](https://www.science.org/doi/10.1126/science.add4679) (DeepNash)
548 | 
549 | <details><summary><strong>Tier 4+</strong></summary>
550 | 
551 | - ✨ [AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning](http://arxiv.org/abs/2308.03526)
552 | - ✨ [Bayesian Reinforcement Learning with Limited Cognitive Load](http://arxiv.org/abs/2305.03263)
553 | - ✨ [Contrastive Prefence Learning: Learning from Human Feedback without RL](http://arxiv.org/abs/2310.13639)
554 | - ✨ [Grandmaster-Level Chess Without Search](http://arxiv.org/abs/2402.04494)
555 | - [A data-driven approach for learning to control computers](https://arxiv.org/abs/2202.08137)
556 | - [Acquisition of Chess Knowledge in AlphaZero](https://arxiv.org/abs/2111.09259)
557 | - [Player of Games](https://arxiv.org/abs/2112.03178)
558 | - [Retrieval-Augmented Reinforcement Learning](https://arxiv.org/abs/2202.08417)
559 | 
560 | </details>
561 | 
562 | ## The big picture
563 | 
564 | ### AI scaling
565 | 
566 | **Tier 1**
567 | 
568 | - [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
569 | - [Takeoff speeds](https://sideways-view.com/2018/02/24/takeoff-speeds/)
570 | - [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)
571 | 
572 | **Tier 2**
573 | 
574 | - [AI and compute](https://openai.com/blog/ai-and-compute/)
575 | - [Scaling Laws for Transfer](https://arxiv.org/abs/2102.01293)
576 | - [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) (Chinchilla)
577 | 
578 | **Tier 3**
579 | 
580 | - [Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
581 | - [Transcending Scaling Laws with 0.1% Extra Compute](https://arxiv.org/abs/2210.11399) (U-PaLM)
582 | 
583 | <details><summary><strong>Tier 4+</strong></summary>
584 | 
585 | - ✨ [Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws](http://arxiv.org/abs/2404.05405)
586 | - ✨ [Power Law Trends in Speedrunning and Machine Learning](http://arxiv.org/abs/2304.10004)
587 | - ✨ [Scaling laws for single-agent reinforcement learning](http://arxiv.org/abs/2301.13442)
588 | - [Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)
589 | - [Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
590 | - [Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)
591 | 
592 | </details>
593 | 
594 | ### AI safety
595 | 
596 | **Tier 1**
597 | 
598 | - [Three impacts of machine intelligence](https://www.effectivealtruism.org/articles/three-impacts-of-machine-intelligence-paul-christiano/)
599 | - [What failure looks like](https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like)
600 | - [Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover](https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to)
601 | 
602 | **Tier 2**
603 | 
604 | - ✨ [An Overview of Catastrophic AI Risks](http://arxiv.org/abs/2306.12001)
605 | - [Clarifying “What failure looks like” (part 1)](https://www.lesswrong.com/posts/v6Q7T335KCMxujhZu/clarifying-what-failure-looks-like-part-1)
606 | - [Deep RL from human preferences](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/)
607 | - [The alignment problem from a deep learning perspective](https://arxiv.org/abs/2209.00626)
608 | 
609 | **Tier 3**
610 | 
611 | - ✨ [Scheming AIs: Will AIs fake alignment during training in order to get power?](http://arxiv.org/abs/2311.08379)
612 | - [Measuring Progress on Scalable Oversight for Large Language Models](https://arxiv.org/abs/2211.03540)
613 | - [Risks from Learned Optimization in Advanced Machine Learning Systems](https://arxiv.org/abs/1906.01820)
614 | - [Scalable agent alignment via reward modelling](https://arxiv.org/abs/1811.07871)
615 | 
616 | <details><summary><strong>Tier 4+</strong></summary>
617 | 
618 | - ✨ [AI Deception: A Survey of Examples, Risks, and Potential Solutions](http://arxiv.org/abs/2308.14752)
619 | - ✨ [Benchmarks for Detecting Measurement Tampering](http://arxiv.org/abs/2308.15605)
620 | - ✨ [Chess as a Testing Grounds for the Oracle Approach to AI Safety](http://arxiv.org/abs/2010.02911)
621 | - ✨ [Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence](https://papers.ssrn.com/abstract=4608505)
622 | - ✨ [Model evaluation for extreme risks](http://arxiv.org/abs/2305.15324)
623 | - ✨ [Responsible Reporting for Frontier AI Development](http://arxiv.org/abs/2404.02675)
624 | - ✨ [Safety Cases: How to Justify the Safety of Advanced AI Systems](http://arxiv.org/abs/2403.10462)
625 | - ✨ [Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](http://arxiv.org/abs/2401.05566)
626 | - ✨ [Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure](http://arxiv.org/abs/2311.07590)
627 | - ✨ [Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game](http://arxiv.org/abs/2311.01011)
628 | - ✨ [Tools for Verifying Neural Models' Training Data](http://arxiv.org/abs/2307.00682)
629 | - ✨ [Towards a Cautious Scientist AI with Convergent Safety Bounds](https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/)
630 | - [Alignment of Language Agents](https://arxiv.org/abs/2103.14659)
631 | - [Eliciting Latent Knowledge](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing)
632 | - [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858)
633 | - [Red Teaming Language Models with Language Models](https://storage.googleapis.com/deepmind-media/Red%20Teaming/Red%20Teaming.pdf)
634 | - [Unsolved Problems in ML Safety](https://arxiv.org/abs/2109.13916)
635 | 
636 | </details>
637 | 
638 | ### Economic and social impacts
639 | 
640 | **Tier 3**
641 | 
642 | - ✨ [Explosive growth from AI automation: A review of the arguments](http://arxiv.org/abs/2309.11690)
643 | - ✨ [Language Models Can Reduce Asymmetry in Information Markets](http://arxiv.org/abs/2403.14443)
644 | 
645 | <details><summary><strong>Tier 4+</strong></summary>
646 | 
647 | - ✨ [Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero](http://arxiv.org/abs/2310.16410)
648 | - ✨ [Foundation Models and Fair Use](https://arxiv.org/abs/2303.15715v1)
649 | - ✨ [GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models](http://arxiv.org/abs/2303.10130)
650 | - ✨ [Levels of AGI: Operationalizing Progress on the Path to AGI](http://arxiv.org/abs/2311.02462)
651 | - ✨ [Opportunities and Risks of LLMs for Scalable Deliberation with Polis](http://arxiv.org/abs/2306.11932)
652 | - [On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258)
653 | 
654 | </details>
655 | 
656 | ### Philosophy
657 | 
658 | **Tier 2**
659 | 
660 | - [Meaning without reference in large language models](https://arxiv.org/abs/2208.02957)
661 | 
662 | <details><summary><strong>Tier 4+</strong></summary>
663 | 
664 | - ✨ [Consciousness in Artificial Intelligence: Insights from the Science of Consciousness](http://arxiv.org/abs/2308.08708)
665 | - ✨ [Philosophers Ought to Develop, Theorize About, and Use Philosophically Relevant AI](https://philarchive.org/archive/CLAPOT-16)
666 | - ✨ [Towards Evaluating AI Systems for Moral Status Using Self-Reports](http://arxiv.org/abs/2311.08576)
667 | 
668 | </details>
669 | 
670 | ## Maintainer
671 | 
672 | [andreas@elicit.com](mailto:andreas@elicit.com)
673 | 
674 | 


--------------------------------------------------------------------------------