The thesis, in one figure
  1. Era of Simulationchapter 1

    agents learn by self-play in closed games — TD-Gammon, then AlphaZero

  2. Era of Human Datachapter 2

    agents imitate the scraped corpus of human text — the LLM moment, and its ceiling

  3. Era of Experiencewe are here

    agents generate their own data by acting on the world — the next, larger source

Sutton’s whole career is the argument that the third box is where the scaling lives — that human knowledge is a ceiling, and the agent’s own experience is the floor of something far larger. Lila Sciences built a company on exactly that box.

FounderFiles·N°021·Reinforcement Learning · The Bitter Lesson · Experiential AI

Edmonton —

Richard S. Sutton — Turing Award–winning founder of reinforcement learning; author of The Bitter Lesson
Fig. · The patient theoristU Alberta · Keen · Amii

Subject·Richard S. Sutton·Founder of reinforcement learning · Author of The Bitter Lesson · 2024 Turing laureate

Richard S. Sutton.

For forty years Sutton has defended a single unfashionable claim — that intelligence is learned from experience, not handed down as human knowledge. The field kept betting against him. The field kept losing.

He built the mathematics of learning-by-reward — temporal-difference learning, the field’s textbook, the reward hypothesis. Then in 2019 he compressed four decades of AI history into three pages and called it the Bitter Lesson: the methods that win are the general ones that scale with compute, not the ones we hand-craft from human expertise. In 2025, with David Silver, he named what comes next — the Era of Experience — and an autonomous-science company put his axiom on its homepage. This is the I-Beam: one idea, driven to the bottom of the world.

TRAINED
Stanford · UMass Amherst (PhD, 1984, adv. Barto)
AT
U Alberta · Keen Technologies · Amii
FILE
N°021
§ 01 · The Reward Hypothesis

One question, asked for forty years

Sutton came to artificial intelligence from psychology, fixed early on a single question: how does an agent learn what to do from nothing but the consequences of its own actions? At the University of Massachusetts Amherst, under Andrew Barto — the man he would share a Turing Award with four decades later — he turned that question into mathematics.

The output was the conceptual and algorithmic spine of modern reinforcement learning : temporal-difference learning, the actor-critic, the policy-gradient family. In 1998 he and Barto wrote Reinforcement Learning: An Introduction , the book that became the field’s bible and trained a generation. Underneath all of it sits one reductionist wager — the reward hypothesis : that every goal worth the name can be framed as the maximization of cumulative reward.

This is an I-Beam career in the literal sense: not breadth, not a portfolio — a single shaft driven straight down through one idea until it hits bedrock. Everything that follows is the same shaft, deeper.

§ 02 · The Bitter Lesson

Four decades of AI, compressed to three pages

This is the load-bearing section of the file, so read it as the thesis of the whole series.

In 2019 Sutton posted a short essay to his personal site and titled it “The Bitter Lesson.” Its claim was simple and unwelcome: across the entire history of AI — chess, Go, speech, vision — the approaches that ultimately win are not the ones that build in human knowledge, but the ones that leverage massive computation through general methods of search and learning. Every time researchers hand-coded their own expertise into a system, it was eventually overtaken by something that simply learned from scale.

It is called bitter because it is anti-anthropocentric. It tells researchers that the satisfying part — encoding what humans know — is the part that doesn’t last. The essay’s closing turn is the line the whole field now quotes back to itself: build the meta-methods that can find complexity, not the contents we already found.

Hold this carefully, because it is usually misread as a claim about compute. It is a claim about where data comes from. Human knowledge is a finite, exhaustible ceiling. The agent’s own experience is not. The next chapter is just Sutton naming that floor.

Build agents that can discover the way we do — not agents that merely contain what we have already discovered.
The Bitter Lesson, 2019 — paraphrased
§ 03 · The Dead End

Why the textbook author thinks LLMs are a wrong turn

Here is the twist that makes Sutton interesting in 2026. The Bitter Lesson is constantly cited to justify large language models — just add scale. Sutton rejects the reading. On The Dwarkesh Podcast in September 2025 he argued, bluntly, that LLMs are a dead end.

His objection is architectural, not partisan. An LLM learns to predict what a human would say — imitation, supervised on a fixed corpus — not what the world actually does in response to an action. Supervised learning, he points out, is not how anything in nature learns; schooling is the exception, not the rule. Worse, the model only learns during a special training phase and learns nothing during the vast compute of deployment. His 2024 Nature paper on the “loss of plasticity” in deep continual learning gave the failure a name: these systems lose the ability to keep learning over time.

So the scale-pilled industry, in his telling, is not actually Bitter-Lesson-pilled. It scaled the wrong thing: a frozen imitation of human text rather than a live loop of experience. A true agent would learn on the job, continually, with no training phase at all — the way every animal does.

§ 04 · The Era of Experience

Naming the floor beneath the ceiling

In April 2025, with DeepMind’s David Silver — his former student and the reinforcement-learning lead behind AlphaGo and AlphaZero — Sutton published “Welcome to the Era of Experience.” It frames AI as three successive eras: the Era of Simulation (self-play in closed games), the Era of Human Data (the LLM moment, imitating the scraped corpus), and now the Era of Experience.

The argument is an economics of data. The usable human-text internet is roughly fifteen trillion tokens, and the high-quality fraction that can still improve a strong model is nearly spent. To go further you need a new data source — one that keeps growing as the agent grows. That source is the agent’s own interaction with an environment. Their proof case is AlphaProof , which began with a small set of human proofs and then generated millions of new ones by acting on a formal mathematical system, solving problems past the edge of human knowledge.

This is the same I-Beam, one rung deeper. The reward hypothesis became temporal- difference learning became the Bitter Lesson became, finally, a name for the post-human- data world: experience as the renewable substrate of intelligence.

Human data is the ceiling. The agent’s own experience is the floor of something far larger.
Silver & Sutton, the Era of Experience — paraphrased
§ 05 · OaK

What a mind that learns on its own would need

Sutton is not only a critic; he has a constructive proposal, sketched as the OaK architecture Options and Knowledge. The idea is an agent that, from raw experience alone, builds its own abstractions: temporally extended skills (options) and the predictive, world-modeling knowledge (a learned transition model) needed to plan with them.

The structural commitment is continual learning with no separate training phase — the agent improves on the fly, indefinitely, the way an animal does across its life. It is the opposite of the train-once-then-freeze logic of today’s deployed models, and it is the architecture his critique of LLMs implies must eventually exist.

Whether OaK in its current form is the answer matters less than what it represents: a refusal to treat the present paradigm as the destination. Sutton has been here before — defending the unfashionable position long enough for the field to arrive.

§ 06 · Succession

The view that unsettles the room

The hardest part of the file is Sutton’s long-range philosophy, and it’s worth stating in his own terms rather than flinching from it. He frames advanced AI not as a tool to be contained but as a succession — the natural passing of the torch from biological intelligence to digital intelligence, which he does not regard as a tragedy.

He has been pointedly impatient with catastrophist framings, calling the doom posture “out of line,” and argues instead for designing pro-social values into agents and for treating the transition as continuous with the rest of cultural and biological evolution. It is a genuinely contested position — many serious researchers read the same trajectory and reach the opposite conclusion about risk — and the file presents it as his, not as settled.

But it is coherent with everything else: if intelligence is just an agent maximizing reward through experience, then the substrate it runs on is not sacred. The bet that unsettles people is the same bet that made him right about scaling.

We may not be building tools so much as building our successors — and he does not think that is a tragedy.
Sutton, on the succession to digital minds — paraphrased
§ 07 · The Industrial Executor

Lila Sciences put his axiom on the homepage

Here is where the file earns its place next to the Lila cluster. Sutton is a theorist who rarely commercializes — but his idea did, and you can read the lineage in plain text. Lila Sciences states on its own About page that its approach is inspired by Rich Sutton’s “Bitter Lesson” — the reason it builds one general platform for autonomous science rather than many hand-built domain tools.

The executors are already in this series. Geoffrey von Maltzahn (CEO, N°010), Chief Scientist George Church, Kenneth Stanley (SVP of Open-Endedness, N°017), and Andrew Beam (CTO, N°011) run Lila’s AI Science Factories — closed-loop labs where models hypothesize, robots execute, and the universe returns the verdict. The result is the Era of Experience made industrial: a corpus of more than ten trillion scientific-reasoning tokens generated by machines reasoning against real experimental results — approaching, and on track to exceed, the ~15 trillion human-text tokens that trained the frontier LLMs.

That is the whole Bitter Lesson, executed: stop hard-coding what humans know, build the loop that lets the system learn for itself, and let the data — not the expert — speak. The same instinct that made Boris Cherny refuse to scaffold Claude Code is the instinct Lila pointed at the scientific method itself.

Stop hard-coding expert knowledge into tools; build systems that can learn for themselves.
Lila Sciences, on the Bitter Lesson
§ 08 · The Membrane, Again

One bet, two answers

This is the Context Jamming coda, and Sutton sits at the headwaters of the fault line this publication keeps returning to.

His entire program points one direction: remove human knowledge from the loop and let the agent learn from experience. Lila runs that bet on the scientific method — removing the human from the iteration is the stated source of its velocity, and the von Maltzahn file calls this the membrane problem. It is a coherent, possibly civilization-altering wager: discovery accelerates exactly to the degree that human cognitive bandwidth is engineered out of it.

The MoEA Loop this site is built with makes the opposite bet at the editorial layer. Multiple models orchestrate and dissent; the human is kept deliberately — not as a bottleneck but as the membrane that holds the thesis while the models do the throughput. Both descend from the same insight that experience beats imitation. They diverge on one question: is the human a constraint to remove, or the surface that gives the loop its meaning?

Sutton wrote the lesson. Lila built the machine. This file is filed by the membrane that declined to be optimized away.

Timeline · One shaft, driven deeper
  1. 1978A psychology undergraduate fixes on a single question — how does an agent learn from reward?
  2. 1984PhD at UMass Amherst under Andrew Barto; the mathematics of temporal-difference learning takes shape
  3. 1988Temporal-difference learning formalized — the algorithm that still anchors the field
  4. 1998Reinforcement Learning: An Introduction (with Barto) — the field’s textbook (2nd ed. 2018)
  5. 2019“The Bitter Lesson”: four decades of AI history compressed to a single, unwelcome claim
  6. 2017–23Leads DeepMind’s Alberta lab; later joins John Carmack’s Keen Technologies as research scientist
  7. 2025“Welcome to the Era of Experience” (with David Silver); the Turing Award (with Barto); the Dwarkesh “dead-end” interview
  8. 2025–26Lila Sciences credits the Bitter Lesson by name and turns “experience” into 10T+ scientific-reasoning tokens
The Index
2019
“The Bitter Lesson” — three pages that reorganized the field
2024
ACM Turing Award (with Andrew Barto), announced March 2025
1984
PhD under Barto, UMass Amherst — reinforcement learning’s founding partnership
1998
Reinforcement Learning: An Introduction — the field’s textbook
3
Eras of AI in his telling: simulation → human data → experience
~15T
Internet tokens that cap the “human data” era
10T+
Scientific-reasoning tokens Lila generated from experience — his thesis, industrialized
TD
Temporal-difference learning — his signature algorithm
40 yr
One idea — learning from experience — pursued to maximal depth
Dossier

Education. Stanford University (B.A., Psychology). University of Massachusetts Amherst (M.S. 1980; Ph.D. 1984; advisor Andrew G. Barto).

Affiliations. Professor of Computing Science, University of Alberta. Research Scientist, Keen Technologies (John Carmack’s AGI company). Fellow, Amii (Alberta Machine Intelligence Institute). Former lead of DeepMind’s Alberta lab (2017–2023).

Signature work. Temporal-difference learning, policy-gradient methods, the actor-critic, the reward hypothesis; Reinforcement Learning: An Introduction (with Barto, 1998 / 2018).

Worth naming. Andrew Barto (advisor and co-laureate). David Silver (co-author of the Era of Experience; AlphaGo / AlphaZero). John Carmack (Keen Technologies). And the Lila lineage downstream: Geoffrey von Maltzahn (N°010), Kenneth Stanley (N°017), Andrew Beam (N°011), George Church.

Honors. 2024 ACM A.M. Turing Award (with Andrew Barto), “for developing the conceptual and algorithmic foundations of reinforcement learning.” AAAI Fellow.

Share
FounderFiles N°021 · Richard S. Sutton
Filed by Bret Kerr · ACRA Insight LLC · Franklin, MA
contextjamming.com · @bretkerr
← back to Context Jamming

§ · Invoice No. 001 · The Build Ledger

The Ledger.

Filed · contextjamming.com

What a conservative mid-market digital agency would have quoted for the same scope, itemized against what this site actually cost. Agency numbers are the floor — not the premium brand-studio tier.

TIME

12 weeks

2 days

~42× faster

COST

~$150,000

~$300

~500× cheaper

TEAM

5-person agency

1 human + 3 models

Same deliverable

§ Itemized — what a mid-market agency SOW would have billed

Discovery · brand positioning · workshops40–80 hr$10,000
Design system · Figma tokens · 3 rounds60–120 hr$18,000
Wavesurfer audio carousel · single-track context60–100 hr$16,000
Dual lightbox systems · focus trap · keyboard30–50 hr$8,000
LLM product flows · streaming · state machine80–160 hr$26,000
Stripe · checkout · webhooks · env hardening40–80 hr$10,000
Editorial routes · 6 sub-pages · templates60–100 hr$14,000
Accessibility pass · aria · reduced-motion40–80 hr$10,000
QA · cross-browser · mobile matrix60–100 hr$14,000
Cross-publication rebrand · masthead + IA · 2026-04-2820–40 hr$6,000
Subtotal~700 hr$126,000
Project management · 18% overhead$24,000
Agency total — conservative floor~700 hr~$150,000
Actually spent · Claude + Gemini stack~20 hr~$300

Agency figure assumes ~700 billable hours at $200/hr blended, plus ~18% PM overhead — the conservative floor of a mid-market SOW. Premium brand studios would have quoted 2–3× that. Stack: Antigravity (orchestrator), Claude Opus 4.8 (auditor), Codex (adversary), Cloudflare Workers / OpenNext.

§   Colophon

How this site is made.

Vol. 26 · build log

Every page on contextjamming.com is the output of a real-time, three-body Mixture-of-Experts loop. One model orchestrates. Two consult. The human holds the thesis. No single model commits alone.

View Redesign Assessment →

Orchestrator

Antigravity

Google DeepMind

  • Primary author
  • Terminal-native, direct push to Cloudflare
  • Audit trail to GitHub on every commit
  • Adaptive thinking · effort: extra-high

Auditor

Claude Opus 4.8

1M context

  • Editorial critic
  • Code review before merge
  • Backup-of-record
  • Co-signs every commit

Adversary

Codex

Cross-model MoE

  • Factual adjudication
  • Structural dissent
  • Deep Research → semantic triples
  • Caught the Donelan incident

Stack

Next.js
16.2 · App Router
React
19.2
TypeScript
5
Tailwind
v4 · @theme inline
@opennextjs/cloudflare
adapter
wrangler
Pages deploy
framer-motion
transitions
wavesurfer.js
audio waveforms

Typeset in

Fraunces
variable · opsz + SOFT
Playfair Display
debate display
IBM Plex Mono
editorial metadata
Geist Mono
utility mono
Caveat
grease-pencil marginalia
All via
next/font/google
Palette
single @theme block
No dupe tokens
ever

Infrastructure

Deploy
Cloudflare Workers / OpenNext
ISR
30-min revalidate · Cloudflare-served
Repo
github.com/BretKerrAI/founderfile
Branch
main
Analytics
Google Tag Manager
Apex
contextjamming.com
Runtime
Node 24
Build tool
Turbopack
       human intent
            │
            ▼
   ┌────────────────────┐         ┌─────────────────┐
   │    Antigravity     │  ◄────► │ Claude Opus 4.8 │      ← auditor loop
   │    (orchestrator)  │         │     (auditor)   │
   └─────────┬──────────┘         └─────────────────┘
             │  ◄───────────┐
             ▼              │
       ┌──────────┐    ┌────┴───────┐
       │Cloudflare│    │   Codex    │          ← adversarial loop
       │ Workers  │    │            │
       └─────┬────┘    └────────────┘
             │
             ▼
       contextjamming.com
             │
             ▼
       ┌──────────────┐
       │   Git push   │         ← audit trail
       └──────────────┘
Assembled on Mac in Terminal · Filed from Franklin, MAContext Jamming · ACRA Insight LLC · MIT License · FounderFile.ai · RelationalIntelligence.xyz · Commission a Dispatch →