Chris Olah.

Co-founder & Head of Interpretability · Anthropic

He never finished a degree. He helped invent how the field looks at itself. From the Google Brain Circuits thread to Distillto the sparse-autoencoder wave that cracked polysemanticity open, Olah has spent a decade teaching the field how to see inside neural networks — and building the institutions that keep the field honest about what it finds.

BORN

Toronto, Canada

AT

Anthropic

FILE

N°002

§ 01 · The Beginning

The Self-Taught Path

Chris Olah grew up in Toronto and was, by the conventions of the research career, ineligible. He did not finish university. He did not have a PhD. He had, for several years in the early 2010s, what most academic recruiters would call a gap in his record — a period during which he was reading, programming, and writing on a personal blog about neural networks at a moment when nobody quite knew what neural networks were going to be.

The blog turned out to be the credential. Olah was hired into Google Brain on the strength of his portfolio — visualizations of convolutional neural networks that made the internals of an opaque system look, for the first time, like something a human could read. The argument was implicit but unmistakable: the inside of a neural network is not a black box if you bother to design the tools to look at it.

§ 02 · The Journal

Distill, and the Standards Shift

In 2017, Olah co-founded Distill with Shan Carter and a small group of collaborators. Distillwas a journal in the same way that an instrument is a journal — it published machine learning research as visual, interactive articles instead of the dense PDFs the field had grown up on. Building Blocks of Interpretability, Feature Visualization, the Circuits issue: each one rewrote the expectation of what a research paper could communicate.

When Distill went on indefinite hiatus in 2021, Olah wrote that the journal had done what it was built to do. The standards had shifted. Other venues were publishing interactive work. The medium had moved.

§ 03 · The Circuits Thread

Networks Have Mechanisms

The 2020 essay Zoom In: An Introduction to Circuitsis the load-bearing claim of Olah’s career. The argument is simple to state and consequential to verify: trained neural networks are not inscrutable. They contain interpretable mechanisms — circuits — that compute identifiable features and combine them in ways a researcher can read.

Curve detectors. Edge detectors. Pose-invariant neurons. The 2021 Multimodal Neuronspaper, written with collaborators at OpenAI, found that CLIP’s neurons activate on the abstract concept of, say, “Spider-Man,” whether presented as the comic-book panel or the literal word printed on a sign. The same unit. The same direction. Different surface forms.

The implication: networks form abstractions the way humans do. The disagreement is only about how legibly.

“The inside of a neural network is not a black box. It is a city. Somebody has to draw the maps.”

The thesis, compressed

§ 04 · Anthropic

The Polysemanticity Problem

In 2021, Olah co-founded Anthropic with Dario and Daniela Amodei and a small group of researchers from OpenAI. He became Head of Interpretability. The early years of the lab were dominated by one stubborn fact that the Circuits research had surfaced: features in real networks are rarely clean. A single neuron will fire on, say, “Christmas” AND “curve detectors” AND “the names of seventeen unrelated cities” — not because the network is confused but because, in a model with finite neurons and effectively infinite concepts to encode, neurons get reused.

The problem had a name: polysemanticity. And it was a wall. If you could not point to a single neuron and say “this one represents X,” the whole interpretability program lived under a question mark.

§ 05 · The Breakthrough

Sparse Autoencoders, At Scale

The paper Towards Monosemanticity (2023), and the follow-up Scaling Monosemanticity (2024), described the workaround. Train a small, overcomplete autoencoder — a sparse autoencoder, or SAE — on the activations of a real language model. The SAE’s job is to recover an enormous dictionary of features, each one corresponding to a single, interpretable concept. Polysemantic neurons in the original network decompose into clean, monosemantic features in the SAE.

By 2024, Anthropic had scaled the technique to Claude 3 Sonnet and pulled tens of millions of features out of a frontier model. Features for countries, for emotions, for inner conflict, for code patterns, for the Golden Gate Bridge. Each one a direction. Each one editable.

Olah’s decade-long bet — that interpretability was a science with handles, not a hope — had a result you could clamp on the model’s activations and watch its behavior shift. The polysemanticity wall was, at minimum, a doorway.

Timeline

~2010Self-taught in Toronto. Leaves university; writes blog posts on neural networks instead.
2014Joins Google Brain. Begins the Feature Visualization line of work.
2017Co-founds Distill — the visual, interactive ML journal.
2018Joins OpenAI. Continues the Circuits agenda.
2021Co-founds Anthropic with Dario Amodei, Daniela Amodei, and others. Becomes Head of Interpretability.
2023Anthropic publishes Towards Monosemanticity. Sparse autoencoders crack polysemanticity open.
2024Scaling Monosemanticity: SAEs at Claude scale. Features become a research substrate.

The Olah Reading List

human intent │ ▼ ┌────────────────────┐ ┌─────────────────┐ │ Claude Code 4.7 │ ◄────► │ Claude Opus 4.6 │ ← auditor loop │ (orchestrator) │ │ (auditor) │ └─────────┬──────────┘ └─────────────────┘ │ ◄───────────┐ ▼ │ ┌──────────┐ ┌────┴───────┐ │ Vercel │ │ Gemini 3.1 │ ← adversarial loop │ (edge) │ │ Pro │ └─────┬────┘ └────────────┘ │ ▼ contextjamming.com │ ▼ ┌──────────────┐ │ Git push │ ← audit trail └──────────────┘

CONTEXT JAMMING

Chris Olah.

The Self-Taught Path

Distill, and the Standards Shift

Networks Have Mechanisms

The Polysemanticity Problem

Sparse Autoencoders, At Scale

The Ledger.

How this site is made.

Claude Code 4.7

Claude Opus 4.6

Gemini 3.1 Pro