awesome-interpretability-in-large-language-models
github.com/ruizheliuoa/awesome-interpretability-in-large-language-models ↗This repository collects all relevant resources about interpretability in LLMs
401
GitHub Stars
239
Curated Resources
8
Categories
5 hours ago
Last Refreshed
Survey PapersPosition PapersInterpretable Analysis of LLMsSAE, Dictionary Learning and SuperpositionInterpretability in Vision LLMsBenchmarking InterpretabilityEnhancing InterpretabilityOthers
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me resources resources from awesome-interpretability-in-large-language-models"
Installation instructions →What's inside
Resources
- 200 Concrete Open Problems in Mechanistic Interpretability
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
Chapter 6, Deep Learning
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
Chapter 5, Deep Learning
- 3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning
Chapter 7, Deep Learning
- A Barebones Guide to Mechanistic Interpretability Prerequisites
- AI Alignment Forum
Benchmarking Interpretability
- A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
ACL
- Benchmarking Mental State Representations in Language Models
MechInterp@ICML
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
arXiv
- RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
arXiv
Interpretable Analysis of LLMs
- Activation Addition: Steering Language Models Without Optimization
arXiv
- A Language Model's Guide Through Latent Space
arXiv
- A Mathematical Framework for Transformer Circuits
Anthropic
- A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
arXiv
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
ICML
- Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
arXiv
SAE, Dictionary Learning and Superposition
- Activation Steering with SAEs
LessWrong
- Automatically Identifying Local and Global Circuits with Linear Computation Graphs
arXiv
- Codebook Features: Sparse and Discrete Interpretability for Neural Networks
arXiv
- Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
arXiv
- Distributed Representations: Composition & Superposition
Anthropic
- Do sparse autoencoders find "true features"?
LessWrong
Interpretability in Vision LLMs
- Analyzing Vision Transformers for Image Classification in Class Embedding Space
NeurIPS
- Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
MechInterp@ICML
- Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
MechInterp@ICML
- Dissecting Query-Key Interaction in Vision Transformers
MechInterp@ICML
- Don’t trust your eyes: on the (un)reliability of feature visualizations
ICML
- Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
arXiv
Others
Survey Papers
- A Primer on the Inner Workings of Transformer-based Language Models
arXiv
- Attention Heads of Large Language Models: A Survey
arXiv
- From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
arXiv
- From Understanding to Utilization: A Survey on Explainability for Large Language Models
arXiv
- Internal Consistency and Self-Feedback in Large Language Models: A Survey
arXiv
- Knowledge Mechanisms in Large Language Models: A Survey and Perspective
EMNLP
Showing a sample of 239 resources. View the full list on GitHub →