awesome-llm-interpretability

A curated list of Large Language Model (LLM) Interpretability resources.

1.6k

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me llm interpretability articles resources from awesome-llm-interpretability"

200 Concrete Open Problems in Mechanistic Interpretability
Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.
A circuit for Python docstrings in a 4-layer attention-only transformer
Proposes the Quantization Model for explaining neural scaling laws in neural networks.
A Mechanistic Interpretability Analysis of Grokking
Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.
A New Approach to Computation Reimagines Artificial Intelligenceg
Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.
Attribution Patching: Activation Patching At Industrial Scale
Method that uses gradients for a linear approximation of activation patching in neural networks.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.

Alignment Lab AI
Group of researchers focusing on AI alignment.
EleutherAI
Non-profit AI research lab that focuses on interpretability and alignment of large models.
Nous Research
Research group discussing various topics on interpretability.
PAIR
at Google work on

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Identifies backward chaining circuits in a transformer trained to perform pathfinding in a tree.
An Overview of Early Vision in InceptionV1
A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Examines small neural networks to understand how they learn group compositions, using representation theory.
Augmenting Interpretable Models with LLMs during Training
Use LLMs to build interpretable classifiers of text data
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
Causal mediation analysis as a method for interpreting neural models in natural language processing.
ChainPoll: A High Efficacy Method for LLM Hallucination Detection
ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.

A Survey of Large Language Models
. This survey paper provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers..

Attention Analysis
Analyzing attention maps from BERT transformer.
Automated Interpretability
Code for automatically generating, simulating, and scoring explanations of neuron behavior.
Awesome-Attention-Heads
A carefully compiled list that summarizes the diverse functions of the attention heads.
Comgra
Comgra helps you analyze and debug neural networks in pytorch.
Copy Suppression
Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs.
ecco
A python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Showing a sample of 89 resources. View the full list on GitHub →