├── LICENSE
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 Mor Geva
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Interpretability of Large Language Models (0368.4264)
2 |
3 | This repository contains materials for the **Interpretability of Large Language Models** course (0368.4264) at Tel Aviv University. It is a graduate-level, active-learning course in which students learn about interpretability of LLMs in the style of a collaborative research group. The course is structured around weekly paper readings, in-class discussions, role-playing, and hands-on exercises.[^1] Students are assumed to have prior background in natural language processing and machine learning.
4 |
5 | In this repository, you will find:
6 | * Schedule and reading lists
7 | * Coding exercises and challenges
8 |
9 | The course was developed by Dr. Mor Geva and Daniela Gottesman at Tel Aviv University. We also thank Amit Elhelo, Or Shafran, and Yoav Gur-Arieh for their contributions. We share these materials and hope they serve as a useful resource for anyone curious about or working on the interpretability of large language models.
10 |
11 | [^1]: The course format draws inspiration from the [paper-reading seminar by Alec Jacobson and Colin Raffel](https://colinraffel.com/blog/role-playing-seminar.html) and [The Science of Large Language Models course by Robin Jia](https://robinjia.github.io/classes/fall2024-csci699.html).
12 |
13 | ## Schedule and materials
14 |
15 | The schedule is subject to minor changes.
16 |
17 | Week | Date | Topic and papers | Practicum
18 | | ------- | ------ | ------- | ------- |
19 | | 1 | Oct 26 | **Introduction and role assignments**
Background and NLP refresher | [Exercise](https://colab.research.google.com/drive/1hAsTsbnwb6jZkfRyRPokTOWnoK34UdB1?usp=sharing) [Solution](https://colab.research.google.com/drive/1kTcuXuRjkzpmuacW-MSNK4bMKJVef5Sy?usp=sharing)|
20 | | 2 | Nov 2 | **Probing**
Main paper 1: [Language Models Represent Space and Time](https://arxiv.org/abs/2310.02207)
Main paper 2: [A Structural Probe for Finding Syntax in Word Representations](https://aclanthology.org/N19-1419/)
Bonus papers:
* [Not All Language Model Features Are One-Dimensionally Linear](https://arxiv.org/abs/2405.14860) |[Exercise](https://colab.research.google.com/drive/1NSWt5QutY-4hJtY2jG23LkUfY7mEFuub?usp=sharing) [Solution](https://colab.research.google.com/drive/13SEeFqWmjXZjSAWxk4YYd0P282G7-e2R?usp=sharing) |
21 | | 3 | Nov 9 | **Inspecting representations**
Main paper 1: [Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models](https://arxiv.org/abs/2401.06102)
Main paper 2: [Language Model Inversion](https://arxiv.org/abs/2311.13647)
Bonus papers:
* [SelfIE: Self-Interpretation of Large Language Model Embeddings](https://arxiv.org/abs/2403.10949)
* [LatentQA: Teaching LLMs to Decode Activations Into Natural Language](https://arxiv.org/abs/2412.08686)
* [logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) |[Exercise](https://colab.research.google.com/drive/15W76ULQvgayXcXzMJ9MBeijrkDAq1k80?usp=sharing) [Solution](https://colab.research.google.com/drive/1Be42DpybrBxO3I-vZVyVuwo_acgIJucQ?usp=sharing) |
22 | | 4 | Nov 16 | **Attention heads**
Main paper 1: [Inferring Functionality of Attention Heads from their Parameters](https://arxiv.org/abs/2412.11965)
Main paper 2: [Talking Heads: Understanding Inter-layer Communication in Transformer Language Models](https://arxiv.org/abs/2406.09519)
Bonus papers:
* [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
* [Attention Heads of Large Language Models: A Survey](https://arxiv.org/abs/2409.03752)
* [Analyzing Transformers in Embedding Space](https://arxiv.org/abs/2209.02535) |[Exercise](https://colab.research.google.com/drive/1KKTW04LBoNNnO-hF24GOOCgUs7QBiahW?usp=sharing) [Solution](https://colab.research.google.com/drive/1VpV2WM7TdCJPnhvBttaw0laBnjvpSfN3?usp=sharing)|
23 | | 5 | Nov 23 | **MLP layers**
Main paper 1: [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913)
Main paper 2: [Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization](https://arxiv.org/abs/2506.10920)
Bonus papers:
* [Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space](https://arxiv.org/abs/2203.14680)
* [Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions](https://arxiv.org/abs/2402.15055) |[Exercise](https://colab.research.google.com/drive/1VEBEW50HqzxabyAjU5ixM_juMNKCWMSK?usp=sharing) [Solution](https://colab.research.google.com/drive/152ufURekSGrojYXI0LaBzpiC6ivY7Gwr?usp=sharing) |
24 | | 6 | Nov 30 | **Neurons (are they the right unit?)**
Main paper 1: [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://arxiv.org/abs/2305.01610)
Main paper 2: [Confidence Regulation Neurons in Language Models](https://arxiv.org/abs/2406.16254)
Bonus papers:
* [An Interpretability Illusion for BERT](https://arxiv.org/abs/2104.07143)
* [The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability](https://arxiv.org/abs/2408.01416)
* [Neurons in Large Language Models: Dead, N-gram, Positional](https://arxiv.org/abs/2309.04827) | |
25 | | 7 | Dec 7 | **Feature representations**
Main paper 1: [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2309.08600)
Main paper 2: [The Geometry of Categorical and Hierarchical Concepts in Large Language Models](https://arxiv.org/abs/2406.01506)
Bonus papers:
* [The Linear Representation Hypothesis and the Geometry of Large Language Models](https://arxiv.org/abs/2311.03658)
* [Transcoders Find Interpretable LLM Feature Circuits](https://arxiv.org/abs/2406.11944) | |
26 | | 8 | Dec 14 | **Describing features**
Main paper 1: [Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928)
Main paper 2: [Enhancing Automated Interpretability with Output-Centric Feature Descriptions](https://arxiv.org/abs/2501.08319)
Bonus papers:
* [Language models can explain neurons in language models](https://openai.com/index/language-models-can-explain-neurons-in-language-models/)
* [Rigorously Assessing Natural Language Explanations of Neurons](https://arxiv.org/abs/2309.10312)
* [SAEs Are Good for Steering -- If You Select the Right Features](https://arxiv.org/abs/2505.20063) | |
27 | | 9 | Dec 28 | **Circuit discovery**
Main paper 1: [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593)
Main paper 2: [Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms](https://arxiv.org/abs/2403.17806)
Bonus papers:
* [Towards Automated Circuit Discovery for Mechanistic Interpretability](https://arxiv.org/abs/2304.14997)
* [Position-aware Automatic Circuit Discovery](https://arxiv.org/abs/2502.04577)
* [Circuit Component Reuse Across Tasks in Transformer Language Models](https://arxiv.org/abs/2310.08744) | |
28 | | 10 | Jan 4 | **Binding mechanisms**
Main paper 1: [How do Language Models Bind Entities in Context?](https://arxiv.org/abs/2310.17191)
Main paper 2: [Language Models use Lookbacks to Track Beliefs](https://arxiv.org/abs/2505.14685)
Bonus papers:
* [Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context](https://arxiv.org/abs/2510.06182)
* [Monitoring Latent World States in Language Models with Propositional Probes](https://arxiv.org/abs/2406.19501)
* [Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking](https://arxiv.org/abs/2402.14811) | |
29 | | 11 | Jan 11 | **Factual knowledge recall and editing**
Main paper 1: [Locating and Editing Factual Associations in GPT](https://arxiv.org/abs/2202.05262)
Main paper 2: [Dissecting Recall of Factual Associations in Auto-Regressive Language Models](https://arxiv.org/abs/2304.14767)
Bonus paper:
* [Linearity of Relation Decoding in Transformer Language Models](https://arxiv.org/abs/2308.09124)
* [Characterizing Mechanisms for Factual Recall in Language Models](https://arxiv.org/abs/2310.15910)
* [Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models](https://arxiv.org/abs/2301.04213) | |
30 | | 12 | Jan 18 | **Training dynamics**
Main paper 1: [LLM Circuit Analyses Are Consistent Across Training and Scale](https://arxiv.org/abs/2407.10827)
Main paper 2: [What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation](https://arxiv.org/abs/2404.07129)
Bonus papers:
* [On Linear Representations and Pretraining Data Frequency in Language Models](https://arxiv.org/abs/2504.12459) | |
31 | | 13 | Jan 25 | **Project presentations**
**Conclusion** | |
32 |
33 |
34 | ## Questions and feedback
35 |
36 | If you have questions or suggestions, please open an issue in this repository.
37 |
--------------------------------------------------------------------------------