FounderFiles·N°002·Interpretability
Filed 04.25.26

Subject·Christopher Olah·researcher · interpreter · cartographer
Chris Olah.
Co-founder & Head of Interpretability · Anthropic
He never finished a degree. He helped invent how the field looks at itself. From the Google Brain Circuits thread to Distillto the sparse-autoencoder wave that cracked polysemanticity open, Olah has spent a decade teaching the field how to see inside neural networks — and building the institutions that keep the field honest about what it finds.
The Self-Taught Path
Chris Olah grew up in Toronto and was, by the conventions of the research career, ineligible. He did not finish university. He did not have a PhD. He had, for several years in the early 2010s, what most academic recruiters would call a gap in his record — a period during which he was reading, programming, and writing on a personal blog about neural networks at a moment when nobody quite knew what neural networks were going to be.
The blog turned out to be the credential. Olah was hired into Google Brain on the strength of his portfolio — visualizations of convolutional neural networks that made the internals of an opaque system look, for the first time, like something a human could read. The argument was implicit but unmistakable: the inside of a neural network is not a black box if you bother to design the tools to look at it.
Distill, and the Standards Shift
In 2017, Olah co-founded Distill with Shan Carter and a small group of collaborators. Distillwas a journal in the same way that an instrument is a journal — it published machine learning research as visual, interactive articles instead of the dense PDFs the field had grown up on. Building Blocks of Interpretability, Feature Visualization, the Circuits issue: each one rewrote the expectation of what a research paper could communicate.
When Distill went on indefinite hiatus in 2021, Olah wrote that the journal had done what it was built to do. The standards had shifted. Other venues were publishing interactive work. The medium had moved.
Networks Have Mechanisms
The 2020 essay Zoom In: An Introduction to Circuitsis the load-bearing claim of Olah’s career. The argument is simple to state and consequential to verify: trained neural networks are not inscrutable. They contain interpretable mechanisms — circuits — that compute identifiable features and combine them in ways a researcher can read.
Curve detectors. Edge detectors. Pose-invariant neurons. The 2021 Multimodal Neuronspaper, written with collaborators at OpenAI, found that CLIP’s neurons activate on the abstract concept of, say, “Spider-Man,” whether presented as the comic-book panel or the literal word printed on a sign. The same unit. The same direction. Different surface forms.
The implication: networks form abstractions the way humans do. The disagreement is only about how legibly.
“The inside of a neural network is not a black box. It is a city. Somebody has to draw the maps.”
The Polysemanticity Problem
In 2021, Olah co-founded Anthropic with Dario and Daniela Amodei and a small group of researchers from OpenAI. He became Head of Interpretability. The early years of the lab were dominated by one stubborn fact that the Circuits research had surfaced: features in real networks are rarely clean. A single neuron will fire on, say, “Christmas” AND “curve detectors” AND “the names of seventeen unrelated cities” — not because the network is confused but because, in a model with finite neurons and effectively infinite concepts to encode, neurons get reused.
The problem had a name: polysemanticity. And it was a wall. If you could not point to a single neuron and say “this one represents X,” the whole interpretability program lived under a question mark.
Sparse Autoencoders, At Scale
The paper Towards Monosemanticity (2023), and the follow-up Scaling Monosemanticity (2024), described the workaround. Train a small, overcomplete autoencoder — a sparse autoencoder, or SAE — on the activations of a real language model. The SAE’s job is to recover an enormous dictionary of features, each one corresponding to a single, interpretable concept. Polysemantic neurons in the original network decompose into clean, monosemantic features in the SAE.
By 2024, Anthropic had scaled the technique to Claude 3 Sonnet and pulled tens of millions of features out of a frontier model. Features for countries, for emotions, for inner conflict, for code patterns, for the Golden Gate Bridge. Each one a direction. Each one editable.
Olah’s decade-long bet — that interpretability was a science with handles, not a hope — had a result you could clamp on the model’s activations and watch its behavior shift. The polysemanticity wall was, at minimum, a doorway.
- ~2010Self-taught in Toronto. Leaves university; writes blog posts on neural networks instead.
- 2014Joins Google Brain. Begins the Feature Visualization line of work.
- 2017Co-founds Distill — the visual, interactive ML journal.
- 2018Joins OpenAI. Continues the Circuits agenda.
- 2021Co-founds Anthropic with Dario Amodei, Daniela Amodei, and others. Becomes Head of Interpretability.
- 2023Anthropic publishes Towards Monosemanticity. Sparse autoencoders crack polysemanticity open.
- 2024Scaling Monosemanticity: SAEs at Claude scale. Features become a research substrate.
- 2017Feature VisualizationDistill →
- 2018The Building Blocks of InterpretabilityDistill →
- 2020Zoom In: An Introduction to CircuitsDistill →
- 2021Multimodal Neurons in Artificial Neural NetworksDistill →
- 2023Towards Monosemanticity: Decomposing Language Models with Dictionary LearningAnthropic / Transformer Circuits →
- 2024Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetAnthropic / Transformer Circuits →