├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Mor Geva 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Interpretability of Large Language Models (0368.4264) 2 | 3 | This repository contains materials for the **Interpretability of Large Language Models** course (0368.4264) at Tel Aviv University. It is a graduate-level, active-learning course in which students learn about interpretability of LLMs in the style of a collaborative research group. The course is structured around weekly paper readings, in-class discussions, role-playing, and hands-on exercises.[^1] Students are assumed to have prior background in natural language processing and machine learning. 4 | 5 | In this repository, you will find: 6 | * Schedule and reading lists 7 | * Coding exercises and challenges 8 | 9 | The course was developed by Dr. Mor Geva and Daniela Gottesman at Tel Aviv University. We also thank Amit Elhelo, Or Shafran, and Yoav Gur-Arieh for their contributions. We share these materials and hope they serve as a useful resource for anyone curious about or working on the interpretability of large language models. 10 | 11 | [^1]: The course format draws inspiration from the [paper-reading seminar by Alec Jacobson and Colin Raffel](https://colinraffel.com/blog/role-playing-seminar.html) and [The Science of Large Language Models course by Robin Jia](https://robinjia.github.io/classes/fall2024-csci699.html). 12 | 13 | ## Schedule and materials 14 | 15 | The schedule is subject to minor changes. 16 | 17 | Week | Date | Topic and papers | Practicum 18 | | ------- | ------ | ------- | ------- | 19 | | 1 | Oct 26 | **Introduction and role assignments**
Background and NLP refresher | [Exercise](https://colab.research.google.com/drive/1hAsTsbnwb6jZkfRyRPokTOWnoK34UdB1?usp=sharing) [Solution](https://colab.research.google.com/drive/1kTcuXuRjkzpmuacW-MSNK4bMKJVef5Sy?usp=sharing)| 20 | | 2 | Nov 2 | **Probing**
Main paper 1: [Language Models Represent Space and Time](https://arxiv.org/abs/2310.02207)
Main paper 2: [A Structural Probe for Finding Syntax in Word Representations](https://aclanthology.org/N19-1419/)
Bonus papers:
* [Not All Language Model Features Are One-Dimensionally Linear](https://arxiv.org/abs/2405.14860) |[Exercise](https://colab.research.google.com/drive/1NSWt5QutY-4hJtY2jG23LkUfY7mEFuub?usp=sharing) [Solution](https://colab.research.google.com/drive/13SEeFqWmjXZjSAWxk4YYd0P282G7-e2R?usp=sharing) | 21 | | 3 | Nov 9 | **Inspecting representations**
Main paper 1: [Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models](https://arxiv.org/abs/2401.06102)
Main paper 2: [Language Model Inversion](https://arxiv.org/abs/2311.13647)
Bonus papers:
* [SelfIE: Self-Interpretation of Large Language Model Embeddings](https://arxiv.org/abs/2403.10949)
* [LatentQA: Teaching LLMs to Decode Activations Into Natural Language](https://arxiv.org/abs/2412.08686)
* [logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) |[Exercise](https://colab.research.google.com/drive/15W76ULQvgayXcXzMJ9MBeijrkDAq1k80?usp=sharing) [Solution](https://colab.research.google.com/drive/1Be42DpybrBxO3I-vZVyVuwo_acgIJucQ?usp=sharing) | 22 | | 4 | Nov 16 | **Attention heads**
Main paper 1: [Inferring Functionality of Attention Heads from their Parameters](https://arxiv.org/abs/2412.11965)
Main paper 2: [Talking Heads: Understanding Inter-layer Communication in Transformer Language Models](https://arxiv.org/abs/2406.09519)
Bonus papers:
* [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
* [Attention Heads of Large Language Models: A Survey](https://arxiv.org/abs/2409.03752)
* [Analyzing Transformers in Embedding Space](https://arxiv.org/abs/2209.02535) |[Exercise](https://colab.research.google.com/drive/1KKTW04LBoNNnO-hF24GOOCgUs7QBiahW?usp=sharing) [Solution](https://colab.research.google.com/drive/1VpV2WM7TdCJPnhvBttaw0laBnjvpSfN3?usp=sharing)| 23 | | 5 | Nov 23 | **MLP layers**
Main paper 1: [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913)
Main paper 2: [Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization](https://arxiv.org/abs/2506.10920)
Bonus papers:
* [Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space](https://arxiv.org/abs/2203.14680)
* [Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions](https://arxiv.org/abs/2402.15055) |[Exercise](https://colab.research.google.com/drive/1VEBEW50HqzxabyAjU5ixM_juMNKCWMSK?usp=sharing) [Solution](https://colab.research.google.com/drive/152ufURekSGrojYXI0LaBzpiC6ivY7Gwr?usp=sharing) | 24 | | 6 | Nov 30 | **Neurons (are they the right unit?)**
Main paper 1: [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://arxiv.org/abs/2305.01610)
Main paper 2: [Confidence Regulation Neurons in Language Models](https://arxiv.org/abs/2406.16254)
Bonus papers:
* [An Interpretability Illusion for BERT](https://arxiv.org/abs/2104.07143)
* [The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability](https://arxiv.org/abs/2408.01416)
* [Neurons in Large Language Models: Dead, N-gram, Positional](https://arxiv.org/abs/2309.04827) | | 25 | | 7 | Dec 7 | **Feature representations**
Main paper 1: [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2309.08600)
Main paper 2: [The Geometry of Categorical and Hierarchical Concepts in Large Language Models](https://arxiv.org/abs/2406.01506)
Bonus papers:
* [The Linear Representation Hypothesis and the Geometry of Large Language Models](https://arxiv.org/abs/2311.03658)
* [Transcoders Find Interpretable LLM Feature Circuits](https://arxiv.org/abs/2406.11944) | | 26 | | 8 | Dec 14 | **Describing features**
Main paper 1: [Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928)
Main paper 2: [Enhancing Automated Interpretability with Output-Centric Feature Descriptions](https://arxiv.org/abs/2501.08319)
Bonus papers:
* [Language models can explain neurons in language models](https://openai.com/index/language-models-can-explain-neurons-in-language-models/)
* [Rigorously Assessing Natural Language Explanations of Neurons](https://arxiv.org/abs/2309.10312)
* [SAEs Are Good for Steering -- If You Select the Right Features](https://arxiv.org/abs/2505.20063) | | 27 | | 9 | Dec 28 | **Circuit discovery**
Main paper 1: [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593)
Main paper 2: [Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms](https://arxiv.org/abs/2403.17806)
Bonus papers:
* [Towards Automated Circuit Discovery for Mechanistic Interpretability](https://arxiv.org/abs/2304.14997)
* [Position-aware Automatic Circuit Discovery](https://arxiv.org/abs/2502.04577)
* [Circuit Component Reuse Across Tasks in Transformer Language Models](https://arxiv.org/abs/2310.08744) | | 28 | | 10 | Jan 4 | **Binding mechanisms**
Main paper 1: [How do Language Models Bind Entities in Context?](https://arxiv.org/abs/2310.17191)
Main paper 2: [Language Models use Lookbacks to Track Beliefs](https://arxiv.org/abs/2505.14685)
Bonus papers:
* [Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context](https://arxiv.org/abs/2510.06182)
* [Monitoring Latent World States in Language Models with Propositional Probes](https://arxiv.org/abs/2406.19501)
* [Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking](https://arxiv.org/abs/2402.14811) | | 29 | | 11 | Jan 11 | **Factual knowledge recall and editing**
Main paper 1: [Locating and Editing Factual Associations in GPT](https://arxiv.org/abs/2202.05262)
Main paper 2: [Dissecting Recall of Factual Associations in Auto-Regressive Language Models](https://arxiv.org/abs/2304.14767)
Bonus paper:
* [Linearity of Relation Decoding in Transformer Language Models](https://arxiv.org/abs/2308.09124)
* [Characterizing Mechanisms for Factual Recall in Language Models](https://arxiv.org/abs/2310.15910)
* [Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models](https://arxiv.org/abs/2301.04213) | | 30 | | 12 | Jan 18 | **Training dynamics**
Main paper 1: [LLM Circuit Analyses Are Consistent Across Training and Scale](https://arxiv.org/abs/2407.10827)
Main paper 2: [What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation](https://arxiv.org/abs/2404.07129)
Bonus papers:
* [On Linear Representations and Pretraining Data Frequency in Language Models](https://arxiv.org/abs/2504.12459) | | 31 | | 13 | Jan 25 | **Project presentations**
**Conclusion** | | 32 | 33 | 34 | ## Questions and feedback 35 | 36 | If you have questions or suggestions, please open an issue in this repository. 37 | --------------------------------------------------------------------------------