├── LICENSE
└── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Mor Geva
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Interpretability of Large Language Models (0368.4264)
 2 | 
 3 | This repository contains materials for the **Interpretability of Large Language Models** course (0368.4264) at Tel Aviv University. It is a graduate-level, active-learning course in which students learn about interpretability of LLMs in the style of a collaborative research group. The course is structured around weekly paper readings, in-class discussions, role-playing, and hands-on exercises.[^1] Students are assumed to have prior background in natural language processing and machine learning.
 4 | 
 5 | In this repository, you will find:
 6 | * Schedule and reading lists
 7 | * Coding exercises and challenges
 8 | 
 9 | The course was developed by Dr. Mor Geva and Daniela Gottesman at Tel Aviv University. We also thank Amit Elhelo, Or Shafran, and Yoav Gur-Arieh for their contributions. We share these materials and hope they serve as a useful resource for anyone curious about or working on the interpretability of large language models.
10 | 
11 | [^1]: The course format draws inspiration from the [paper-reading seminar by Alec Jacobson and Colin Raffel](https://colinraffel.com/blog/role-playing-seminar.html) and [The Science of Large Language Models course by Robin Jia](https://robinjia.github.io/classes/fall2024-csci699.html).
12 | 
13 | ## Schedule and materials
14 | 
15 | The schedule is subject to minor changes.
16 | 
17 | Week | Date | Topic and papers | Practicum
18 | | ------- | ------ | ------- | ------- |
19 | | 1 | Oct 26 | **Introduction and role assignments**<br>Background and NLP refresher | [Exercise](https://colab.research.google.com/drive/1hAsTsbnwb6jZkfRyRPokTOWnoK34UdB1?usp=sharing) [Solution](https://colab.research.google.com/drive/1kTcuXuRjkzpmuacW-MSNK4bMKJVef5Sy?usp=sharing)|
20 | | 2 | Nov 2 | **Probing**<br>Main paper 1: [Language Models Represent Space and Time](https://arxiv.org/abs/2310.02207)<br>Main paper 2: [A Structural Probe for Finding Syntax in Word Representations](https://aclanthology.org/N19-1419/)<br>Bonus papers:<br> * [Not All Language Model Features Are One-Dimensionally Linear](https://arxiv.org/abs/2405.14860) |[Exercise](https://colab.research.google.com/drive/1NSWt5QutY-4hJtY2jG23LkUfY7mEFuub?usp=sharing) [Solution](https://colab.research.google.com/drive/13SEeFqWmjXZjSAWxk4YYd0P282G7-e2R?usp=sharing) |
21 | | 3 | Nov 9 | **Inspecting representations**<br>Main paper 1: [Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models](https://arxiv.org/abs/2401.06102)<br>Main paper 2: [Language Model Inversion](https://arxiv.org/abs/2311.13647)<br>Bonus papers:<br> * [SelfIE: Self-Interpretation of Large Language Model Embeddings](https://arxiv.org/abs/2403.10949)<br> * [LatentQA: Teaching LLMs to Decode Activations Into Natural Language](https://arxiv.org/abs/2412.08686)<br> * [logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) |[Exercise](https://colab.research.google.com/drive/15W76ULQvgayXcXzMJ9MBeijrkDAq1k80?usp=sharing) [Solution](https://colab.research.google.com/drive/1Be42DpybrBxO3I-vZVyVuwo_acgIJucQ?usp=sharing) |
22 | | 4 | Nov 16 | **Attention heads**<br>Main paper 1: [Inferring Functionality of Attention Heads from their Parameters](https://arxiv.org/abs/2412.11965)<br>Main paper 2: [Talking Heads: Understanding Inter-layer Communication in Transformer Language Models](https://arxiv.org/abs/2406.09519)<br>Bonus papers:<br> * [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)<br> * [Attention Heads of Large Language Models: A Survey](https://arxiv.org/abs/2409.03752)<br> * [Analyzing Transformers in Embedding Space](https://arxiv.org/abs/2209.02535) |[Exercise](https://colab.research.google.com/drive/1KKTW04LBoNNnO-hF24GOOCgUs7QBiahW?usp=sharing) [Solution](https://colab.research.google.com/drive/1VpV2WM7TdCJPnhvBttaw0laBnjvpSfN3?usp=sharing)|
23 | | 5 | Nov 23 | **MLP layers**<br>Main paper 1: [Transformer Feed-Forward Layers Are Key-Value Memories](https://arxiv.org/abs/2012.14913)<br>Main paper 2: [Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization](https://arxiv.org/abs/2506.10920)<br>Bonus papers:<br> * [Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space](https://arxiv.org/abs/2203.14680)<br> * [Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions](https://arxiv.org/abs/2402.15055) |[Exercise](https://colab.research.google.com/drive/1VEBEW50HqzxabyAjU5ixM_juMNKCWMSK?usp=sharing) [Solution](https://colab.research.google.com/drive/152ufURekSGrojYXI0LaBzpiC6ivY7Gwr?usp=sharing) |
24 | | 6 | Nov 30 | **Neurons (are they the right unit?)**<br>Main paper 1: [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://arxiv.org/abs/2305.01610)<br>Main paper 2: [Confidence Regulation Neurons in Language Models](https://arxiv.org/abs/2406.16254)<br>Bonus papers:<br> * [An Interpretability Illusion for BERT](https://arxiv.org/abs/2104.07143)<br> * [The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability](https://arxiv.org/abs/2408.01416)<br> * [Neurons in Large Language Models: Dead, N-gram, Positional](https://arxiv.org/abs/2309.04827) | |
25 | | 7 | Dec 7 | **Feature representations**<br>Main paper 1: [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2309.08600)<br>Main paper 2: [The Geometry of Categorical and Hierarchical Concepts in Large Language Models](https://arxiv.org/abs/2406.01506)<br>Bonus papers:<br> * [The Linear Representation Hypothesis and the Geometry of Large Language Models](https://arxiv.org/abs/2311.03658)<br> * [Transcoders Find Interpretable LLM Feature Circuits](https://arxiv.org/abs/2406.11944) | |
26 | | 8 | Dec 14 | **Describing features**<br>Main paper 1: [Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928)<br>Main paper 2: [Enhancing Automated Interpretability with Output-Centric Feature Descriptions](https://arxiv.org/abs/2501.08319)<br>Bonus papers:<br> * [Language models can explain neurons in language models](https://openai.com/index/language-models-can-explain-neurons-in-language-models/)<br> * [Rigorously Assessing Natural Language Explanations of Neurons](https://arxiv.org/abs/2309.10312)<br> * [SAEs Are Good for Steering -- If You Select the Right Features](https://arxiv.org/abs/2505.20063) | |
27 | | 9 | Dec 28 | **Circuit discovery**<br>Main paper 1: [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593)<br>Main paper 2: [Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms](https://arxiv.org/abs/2403.17806)<br>Bonus papers:<br> * [Towards Automated Circuit Discovery for Mechanistic Interpretability](https://arxiv.org/abs/2304.14997)<br> * [Position-aware Automatic Circuit Discovery](https://arxiv.org/abs/2502.04577)<br> * [Circuit Component Reuse Across Tasks in Transformer Language Models](https://arxiv.org/abs/2310.08744) | |
28 | | 10 | Jan 4 | **Binding mechanisms**<br>Main paper 1: [How do Language Models Bind Entities in Context?](https://arxiv.org/abs/2310.17191)<br>Main paper 2: [Language Models use Lookbacks to Track Beliefs](https://arxiv.org/abs/2505.14685)<br>Bonus papers:<br> * [Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context](https://arxiv.org/abs/2510.06182)<br> * [Monitoring Latent World States in Language Models with Propositional Probes](https://arxiv.org/abs/2406.19501)<br> * [Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking](https://arxiv.org/abs/2402.14811) | |
29 | | 11 | Jan 11 | **Factual knowledge recall and editing**<br>Main paper 1: [Locating and Editing Factual Associations in GPT](https://arxiv.org/abs/2202.05262)<br>Main paper 2: [Dissecting Recall of Factual Associations in Auto-Regressive Language Models](https://arxiv.org/abs/2304.14767)<br>Bonus paper:<br> * [Linearity of Relation Decoding in Transformer Language Models](https://arxiv.org/abs/2308.09124)<br> * [Characterizing Mechanisms for Factual Recall in Language Models](https://arxiv.org/abs/2310.15910)<br> * [Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models](https://arxiv.org/abs/2301.04213) | |
30 | | 12 | Jan 18 | **Training dynamics**<br>Main paper 1: [LLM Circuit Analyses Are Consistent Across Training and Scale](https://arxiv.org/abs/2407.10827)<br>Main paper 2: [What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation](https://arxiv.org/abs/2404.07129)<br>Bonus papers:<br> * [On Linear Representations and Pretraining Data Frequency in Language Models](https://arxiv.org/abs/2504.12459) | |
31 | | 13 | Jan 25 | **Project presentations**<br>**Conclusion** | |
32 | 
33 | 
34 | ## Questions and feedback
35 | 
36 | If you have questions or suggestions, please open an issue in this repository.
37 | 


--------------------------------------------------------------------------------