The thesis, in one figure

Era of Simulationchapter 1
agents learn by self-play in closed games — TD-Gammon, then AlphaZero
Era of Human Datachapter 2
agents imitate the scraped corpus of human text — the LLM moment, and its ceiling
Era of Experiencewe are here
agents generate their own data by acting on the world — the next, larger source

Sutton’s whole career is the argument that the third box is where the scaling lives — that human knowledge is a ceiling, and the agent’s own experience is the floor of something far larger. Lila Sciences built a company on exactly that box.

FounderFiles·N°021·Reinforcement Learning · The Bitter Lesson · Experiential AI

Edmonton —

Subject·Richard S. Sutton·Founder of reinforcement learning · Author of The Bitter Lesson · 2024 Turing laureate

Richard S. Sutton.

For forty years Sutton has defended a single unfashionable claim — that intelligence is learned from experience, not handed down as human knowledge. The field kept betting against him. The field kept losing.

He built the mathematics of learning-by-reward — temporal-difference learning, the field’s textbook, the reward hypothesis. Then in 2019 he compressed four decades of AI history into three pages and called it the Bitter Lesson: the methods that win are the general ones that scale with compute, not the ones we hand-craft from human expertise. In 2025, with David Silver, he named what comes next — the Era of Experience — and an autonomous-science company put his axiom on its homepage. This is the I-Beam: one idea, driven to the bottom of the world.

TRAINED

Stanford · UMass Amherst (PhD, 1984, adv. Barto)

AT

U Alberta · Keen Technologies · Amii

FILE

N°021

§ 01 · The Reward Hypothesis

One question, asked for forty years

Sutton came to artificial intelligence from psychology, fixed early on a single question: how does an agent learn what to do from nothing but the consequences of its own actions? At the University of Massachusetts Amherst, under Andrew Barto — the man he would share a Turing Award with four decades later — he turned that question into mathematics.

The output was the conceptual and algorithmic spine of modern reinforcement learning : temporal-difference learning, the actor-critic, the policy-gradient family. In 1998 he and Barto wrote Reinforcement Learning: An Introduction , the book that became the field’s bible and trained a generation. Underneath all of it sits one reductionist wager — the reward hypothesis : that every goal worth the name can be framed as the maximization of cumulative reward.

This is an I-Beam career in the literal sense: not breadth, not a portfolio — a single shaft driven straight down through one idea until it hits bedrock. Everything that follows is the same shaft, deeper.

§ 02 · The Bitter Lesson

Four decades of AI, compressed to three pages

This is the load-bearing section of the file, so read it as the thesis of the whole series.

In 2019 Sutton posted a short essay to his personal site and titled it “The Bitter Lesson.” Its claim was simple and unwelcome: across the entire history of AI — chess, Go, speech, vision — the approaches that ultimately win are not the ones that build in human knowledge, but the ones that leverage massive computation through general methods of search and learning. Every time researchers hand-coded their own expertise into a system, it was eventually overtaken by something that simply learned from scale.

It is called bitter because it is anti-anthropocentric. It tells researchers that the satisfying part — encoding what humans know — is the part that doesn’t last. The essay’s closing turn is the line the whole field now quotes back to itself: build the meta-methods that can find complexity, not the contents we already found.

Hold this carefully, because it is usually misread as a claim about compute. It is a claim about where data comes from. Human knowledge is a finite, exhaustible ceiling. The agent’s own experience is not. The next chapter is just Sutton naming that floor.

“Build agents that can discover the way we do — not agents that merely contain what we have already discovered.”

The Bitter Lesson, 2019 — paraphrased

§ 03 · The Dead End

Why the textbook author thinks LLMs are a wrong turn

Here is the twist that makes Sutton interesting in 2026. The Bitter Lesson is constantly cited to justify large language models — just add scale. Sutton rejects the reading. On The Dwarkesh Podcast in September 2025 he argued, bluntly, that LLMs are a dead end.

His objection is architectural, not partisan. An LLM learns to predict what a human would say — imitation, supervised on a fixed corpus — not what the world actually does in response to an action. Supervised learning, he points out, is not how anything in nature learns; schooling is the exception, not the rule. Worse, the model only learns during a special training phase and learns nothing during the vast compute of deployment. His 2024 Nature paper on the “loss of plasticity” in deep continual learning gave the failure a name: these systems lose the ability to keep learning over time.

So the scale-pilled industry, in his telling, is not actually Bitter-Lesson-pilled. It scaled the wrong thing: a frozen imitation of human text rather than a live loop of experience. A true agent would learn on the job, continually, with no training phase at all — the way every animal does.

§ 04 · The Era of Experience

Naming the floor beneath the ceiling

In April 2025, with DeepMind’s David Silver — his former student and the reinforcement-learning lead behind AlphaGo and AlphaZero — Sutton published “Welcome to the Era of Experience.” It frames AI as three successive eras: the Era of Simulation (self-play in closed games), the Era of Human Data (the LLM moment, imitating the scraped corpus), and now the Era of Experience.

The argument is an economics of data. The usable human-text internet is roughly fifteen trillion tokens, and the high-quality fraction that can still improve a strong model is nearly spent. To go further you need a new data source — one that keeps growing as the agent grows. That source is the agent’s own interaction with an environment. Their proof case is AlphaProof , which began with a small set of human proofs and then generated millions of new ones by acting on a formal mathematical system, solving problems past the edge of human knowledge.

This is the same I-Beam, one rung deeper. The reward hypothesis became temporal- difference learning became the Bitter Lesson became, finally, a name for the post-human- data world: experience as the renewable substrate of intelligence.

“Human data is the ceiling. The agent’s own experience is the floor of something far larger.”

Silver & Sutton, the Era of Experience — paraphrased

§ 05 · OaK

What a mind that learns on its own would need

Sutton is not only a critic; he has a constructive proposal, sketched as the OaK architecture — Options and Knowledge. The idea is an agent that, from raw experience alone, builds its own abstractions: temporally extended skills (options) and the predictive, world-modeling knowledge (a learned transition model) needed to plan with them.

The structural commitment is continual learning with no separate training phase — the agent improves on the fly, indefinitely, the way an animal does across its life. It is the opposite of the train-once-then-freeze logic of today’s deployed models, and it is the architecture his critique of LLMs implies must eventually exist.

Whether OaK in its current form is the answer matters less than what it represents: a refusal to treat the present paradigm as the destination. Sutton has been here before — defending the unfashionable position long enough for the field to arrive.

§ 06 · Succession

The view that unsettles the room

The hardest part of the file is Sutton’s long-range philosophy, and it’s worth stating in his own terms rather than flinching from it. He frames advanced AI not as a tool to be contained but as a succession — the natural passing of the torch from biological intelligence to digital intelligence, which he does not regard as a tragedy.

He has been pointedly impatient with catastrophist framings, calling the doom posture “out of line,” and argues instead for designing pro-social values into agents and for treating the transition as continuous with the rest of cultural and biological evolution. It is a genuinely contested position — many serious researchers read the same trajectory and reach the opposite conclusion about risk — and the file presents it as his, not as settled.

But it is coherent with everything else: if intelligence is just an agent maximizing reward through experience, then the substrate it runs on is not sacred. The bet that unsettles people is the same bet that made him right about scaling.

“We may not be building tools so much as building our successors — and he does not think that is a tragedy.”

Sutton, on the succession to digital minds — paraphrased

§ 07 · The Industrial Executor

Lila Sciences put his axiom on the homepage

Here is where the file earns its place next to the Lila cluster. Sutton is a theorist who rarely commercializes — but his idea did, and you can read the lineage in plain text. Lila Sciences states on its own About page that its approach is inspired by Rich Sutton’s “Bitter Lesson” — the reason it builds one general platform for autonomous science rather than many hand-built domain tools.

The executors are already in this series. Geoffrey von Maltzahn (CEO, N°010), Chief Scientist George Church, Kenneth Stanley (SVP of Open-Endedness, N°017), and Andrew Beam (CTO, N°011) run Lila’s AI Science Factories — closed-loop labs where models hypothesize, robots execute, and the universe returns the verdict. The result is the Era of Experience made industrial: a corpus of more than ten trillion scientific-reasoning tokens generated by machines reasoning against real experimental results — approaching, and on track to exceed, the ~15 trillion human-text tokens that trained the frontier LLMs.

That is the whole Bitter Lesson, executed: stop hard-coding what humans know, build the loop that lets the system learn for itself, and let the data — not the expert — speak. The same instinct that made Boris Cherny refuse to scaffold Claude Code is the instinct Lila pointed at the scientific method itself.

“Stop hard-coding expert knowledge into tools; build systems that can learn for themselves.”

Lila Sciences, on the Bitter Lesson

§ 08 · The Membrane, Again

One bet, two answers

This is the Context Jamming coda, and Sutton sits at the headwaters of the fault line this publication keeps returning to.

His entire program points one direction: remove human knowledge from the loop and let the agent learn from experience. Lila runs that bet on the scientific method — removing the human from the iteration is the stated source of its velocity, and the von Maltzahn file calls this the membrane problem. It is a coherent, possibly civilization-altering wager: discovery accelerates exactly to the degree that human cognitive bandwidth is engineered out of it.

The MoEA Loop this site is built with makes the opposite bet at the editorial layer. Multiple models orchestrate and dissent; the human is kept deliberately — not as a bottleneck but as the membrane that holds the thesis while the models do the throughput. Both descend from the same insight that experience beats imitation. They diverge on one question: is the human a constraint to remove, or the surface that gives the loop its meaning?

Sutton wrote the lesson. Lila built the machine. This file is filed by the membrane that declined to be optimized away.

Timeline · One shaft, driven deeper

1978A psychology undergraduate fixes on a single question — how does an agent learn from reward?
1984PhD at UMass Amherst under Andrew Barto; the mathematics of temporal-difference learning takes shape
1988Temporal-difference learning formalized — the algorithm that still anchors the field
1998Reinforcement Learning: An Introduction (with Barto) — the field’s textbook (2nd ed. 2018)
2019“The Bitter Lesson”: four decades of AI history compressed to a single, unwelcome claim
2017–23Leads DeepMind’s Alberta lab; later joins John Carmack’s Keen Technologies as research scientist
2025“Welcome to the Era of Experience” (with David Silver); the Turing Award (with Barto); the Dwarkesh “dead-end” interview
2025–26Lila Sciences credits the Bitter Lesson by name and turns “experience” into 10T+ scientific-reasoning tokens

The Index

2019

“The Bitter Lesson” — three pages that reorganized the field

2024

ACM Turing Award (with Andrew Barto), announced March 2025

1984

PhD under Barto, UMass Amherst — reinforcement learning’s founding partnership

1998

Reinforcement Learning: An Introduction — the field’s textbook

Eras of AI in his telling: simulation → human data → experience

~15T

Internet tokens that cap the “human data” era

10T+

Scientific-reasoning tokens Lila generated from experience — his thesis, industrialized

Temporal-difference learning — his signature algorithm

40 yr

One idea — learning from experience — pursued to maximal depth

Reading list / Key works

2019“The Bitter Lesson” — four decades of AI history in three pagesincompleteideas.net →
2025“Welcome to the Era of Experience” (with David Silver)Google DeepMind · MIT Press →
1998Reinforcement Learning: An Introduction (with Andrew Barto)MIT Press · 2nd ed. 2018 →
2024“Loss of plasticity in deep continual learning”Nature
2025“Father of RL thinks LLMs are a dead-end”The Dwarkesh Podcast →

Dossier

Education. Stanford University (B.A., Psychology). University of Massachusetts Amherst (M.S. 1980; Ph.D. 1984; advisor Andrew G. Barto).

Affiliations. Professor of Computing Science, University of Alberta. Research Scientist, Keen Technologies (John Carmack’s AGI company). Fellow, Amii (Alberta Machine Intelligence Institute). Former lead of DeepMind’s Alberta lab (2017–2023).

Signature work. Temporal-difference learning, policy-gradient methods, the actor-critic, the reward hypothesis; Reinforcement Learning: An Introduction (with Barto, 1998 / 2018).

Worth naming. Andrew Barto (advisor and co-laureate). David Silver (co-author of the Era of Experience; AlphaGo / AlphaZero). John Carmack (Keen Technologies). And the Lila lineage downstream: Geoffrey von Maltzahn (N°010), Kenneth Stanley (N°017), Andrew Beam (N°011), George Church.

Honors. 2024 ACM A.M. Turing Award (with Andrew Barto), “for developing the conceptual and algorithmic foundations of reinforcement learning.” AAAI Fellow.

Further reading

“The Bitter Lesson” — Sutton’s three-page essay, in full, on his own site. The shortest path to understanding why Lila builds the way it does.

incompleteideas.net · 2019 · primary source

Share