FounderFiles·N°006·Physics · Scaling · Policy
2009 —
Subject·Dr. Jared Kaplan·Theoretical physicist · Co-founder & CSO, Anthropic · Architect of the scaling laws
Dr. Jared Kaplan.
Kaplan turned machine learning into physics — found the smooth power laws governing compute, parameters, and data, and ran the playbook until the field crossed biological scale.
He was a theoretical physicist working on holographic gravity at Johns Hopkins. He walked into a machine-learning lab and started running dimensional analysis on the loss landscape. Within five years he had co-written the scaling laws paper, co-founded Anthropic, and become the Chief Science Officer warning that the curves he plotted in 2020 were going to put a Fields-medalist intelligence on a desktop before 2030.
From aspects of holography to a loss landscape
He went to Stanford for physics and mathematics. He went to Harvard for a PhD in physics, where his advisor was Nima Arkani-Hamed and his thesis was titled Aspects of Holography. He defended in 2009. Holographyin the AdS/CFT sense — the formal claim that a lower-dimensional boundary surface can encode the full information content of a higher-dimensional bulk space. Black-hole information theory. Conformal field theory. Quantum gravity in negative-curvature toy universes.
He did postdocs at SLAC and Stanford. He joined Johns Hopkins as faculty in the Department of Physics and Astronomy in 2012. He worked on quantum gravity, the conformal bootstrap, and scattering amplitudes — the kind of physics where the test of an idea is whether it generates a precise calculation that survives.
His own statement of method, on his Hopkins faculty page: Genuinely new ideas should lead to new equations. And a few sentences later, the line that should make every Feynman reader sit up: without precise results it’s very difficult to avoid fooling yourself, and others.
In 2019 — after fifteen years as a theoretical physicist in academia — he joined OpenAI as a researcher.
What he brought through the door was a single instinct. Whatever neural networks were, they were a physical system. Physical systems have laws. Laws are findable.
Three pillars and an unedited camera
January 2020. Kaplan delivered a three-part lecture series at the Israel Institute for Advanced Studies titled, with characteristic flatness, Machine Learning I, II, and III. The audience was theoretical physicists. The format was unforgiving — single static camera, ninety-eight minutes, dense whiteboard equations, no edutainment scaffolding.
He stripped the field to three pillars. Define a function class — typically dense linear algebra and a non-linearity like ReLU. Define a learning goal — a loss function the system minimizes. Define an optimization strategy — almost always a variant of gradient descent. That was it. Everything else was implementation detail.
He demystified backpropagation. It was not, he told the room, a novel algorithm specific to AI. It was the chain rule, applied backward through a cached forward pass. The mystique of the field, in his hands, evaporated.
He critiqued stochastic gradient descent through dimensional analysis. The learning rate has no natural unit, which is why it behaves erratically in ill-conditioned loss landscapes. He explained Adam — the optimizer that became the industry standard — as an ad-hoc but elegant approximation of natural gradients, with momentum to push through flat regions and RMSProp scaling to dampen the steep ones. He found the physics under the engineering.
The lecture has 480 views. We will return to this.
Compute ≈ 100 × D × P
The most physical moment of the lecture was the Fermi estimate. Kaplan offered a heuristic for the total compute cost of training a frontier neural network: roughly 100 × D × P, where D is the number of training tokens and P is the parameter count. A back-of-the-envelope kind of number, the kind theoretical physicists use to know whether they are in the right order of magnitude.
He plugged in 2020’s frontier — models in the 10⁹ to 10¹⁰ parameter range, trained on 10¹⁰ to 10¹¹ tokens. The training runs were costing roughly 10²² floating-point operations.
Then he turned to biology.
A human brain runs at roughly one petaflop — 10¹⁵ operations per second — by Fermi estimate from neuron density, synaptic count, and firing rates. A hundred-year human life is therefore roughly 3 × 10²⁴ operationsof total cognitive throughput. This is what a person’s biological hardware does, all in.
The 2020 frontier models were two orders of magnitude below that. Within a factor of ten to a hundred of a single human lifetime, in compute. A very exciting time, Kaplan told the room, with the dryness of a man who knew what he was saying. If general intelligence was a function of running a lifetime’s worth of diverse data through a sufficiently large neural substrate, the silicon was approaching the threshold from below.
Eight months later, OpenAI released GPT-3.
“Without precise results it’s very difficult to avoid fooling yourself, and others.”
Power laws, and the Chinchilla correction
January 23, 2020. Kaplan and his OpenAI co-authors posted Scaling Laws for Neural Language Models to arXiv. The empirical claim was simple and devastating: language model performance, measured in cross-entropy loss, improves as a smooth power-law function of model size, dataset size, and compute. No phase transitions. No mysteries. Just curves, across more than seven orders of magnitude.
The paper specified the optimal allocation. If your compute budget grew by 10x, the majority of the new compute should go to parameters, not data. N ∝ C⁰‧⁷³, D ∝ C⁰‧²⁷. Big models, undertrained. Sparse, heavily parameterized giants.
In 2022, DeepMind’s Hoffmann et al. — the Chinchilla paper — corrected the ratio. They trained hundreds of models across a wider hyperparameter space and showed that Kaplan’s coefficients had been distorted by embedding-parameter accounting and a fixed cosine learning-rate schedule. Re-derived correctly, N ∝ D— model size and data should scale in equal proportion. Roughly twenty tokens per parameter, not five.
Meta’s Llama 3 in 2024 went further. They trained an 8-billion-parameter model on 15 trillion tokens, far past compute-optimal, on the bet that inference cost matters more than training cost in deployment. The smaller, over-trained model kept improving log-linearly anyway.
Kaplan’s specific coefficients were wrong. The shape of the claim — that there are smooth power laws governing this whole thing— was the most consequential physics result in machine learning history. The field is still operating inside it.
The mathematics under the magic
For two years after the Kaplan paper, the AI press ran on a different story. Models were exhibiting emergent capabilities— multi-step arithmetic, Persian translation, logical reasoning — that appeared to switch on suddenly at specific parameter thresholds. Zero capability at 62 billion parameters; high proficiency at 540 billion. The narrative was discontinuity, surprise, danger. The policy class panicked accordingly.
Kaplan’s underlying claim cut against it. The cross-entropy loss was smooth all the way down. If the underlying metric was smooth and the surface capabilities looked discontinuous, something had to give.
In 2023 and 2024, a series of papers gave it. Researchers showed that emergencewas largely an artifact of the evaluation metric. Rigid pass-fail scoring — exact string match, all-or-nothing arithmetic — produced the appearance of phase transitions. Continuous metrics, measuring how close the model’s answer was to the correct token distribution, recovered the smooth curves.
The magic dissolved. The mathematics underneath was the mathematics Kaplan had said it was in 2020. Smooth. Predictable. Log-linear. The field had spent two years scaring itself with discretization artifacts of its own benchmarks.
The policy implications are still being absorbed. If the curves are smooth, capability gain is forecastable. If capability gain is forecastable, the responsibility of the labs to publish their forecasts is harder to deflect. Kaplan’s vindication became, by 2025, the structural argument for the Responsible Scaling Policy he himself drafted at Anthropic.
“Machines could match the level of the greatest physicists within years.”
The boundary encodes the bulk
In 2021 Kaplan co-founded Anthropic with Dario Amodei, Daniela Amodei, Sam McCandlish, and the rest of the senior research and policy contingent who left OpenAI in the same wave. He became Chief Science Officer. In October 2024, additionally, Responsible Scaling Officer.
The intellectual through-line is the load-bearing claim of this file. Kaplan’s PhD thesis — Aspects of Holography— was a study of AdS/CFT, the principle that a lower-dimensional boundary surface encodes the full information content of a higher-dimensional bulk space. The whole structure of the bulk is recoverable from the boundary alone. The boundary is the simpler, more legible object; the bulk is what the boundary’s degrees of freedom do.
Anthropic’s Constitutional AI is structurally that. A surface of explicit principles — a relatively low-dimensional, human-readable constitution — encodes the high-dimensional space of model behavior. The boundary is what the developer writes. The bulk is what the model does. The thesis is that the boundary is enough — that consistent behavior across an enormous space of inputs can be generated from a small, well-chosen set of explicit constraints.
The student of Nima Arkani-Hamed has spent his post-physics career building, in software, the same structural relationship his physics dissertation studied in geometry.
That same student, in 2025, publicly warned that machines could match the level of the world’s greatest theoretical physicists— citing, by name, Edward Witten and Nima Arkani-Hamed. The student is warning that the technology will catch up to the teacher. The 2027–2030 window is, in Kaplan’s frame, when humanity will face the biggest decision— whether to allow recursive self-improvement, the moment past which the boundary may no longer encode the bulk.
The attention paradox
Machine Learning II— the lecture in which Kaplan offered the biological-benchmark Fermi estimate and the Adam-optimizer dimensional analysis, the lecture that, together with the arXiv paper that followed it eight days later, is the rosetta stone of the scaling era — has 480 views as of early 2026.
Machine Learning I sits at roughly 1,500. Machine Learning III at 360. The cumulative viewership of the three lectures that wrote the playbook of frontier AI development is approximately 2,340.
In the same window, a single AI-slop YouTube channel called Bandar Apna Dost, which produces fully automated synthetic videos of an anthropomorphic monkey fighting demons, has accumulated 2.4 billion views.
The disparity is six orders of magnitude.
The architects of the scaling era are drowned out, on the platforms their architecture made possible, by the noise generated by the architecture itself. The scaling laws Kaplan plotted in 2020 are now being run, in industrial bulk, to flood the recommendation algorithm with content that the algorithm cannot distinguish from human-made. The rosetta stone is in the same building as the slop. They are filed next to each other.
This is the Context Jamming problem stated in YouTube-metrics form. The signal is available, sourced, free, primary. The noise is derivative, automated, hypnotic, and orders of magnitude louder.
The architects know.
They are, mostly, choosing to keep teaching anyway.
“2,340 views for the lectures that wrote the scaling era. 2.4 billion for the monkey.”
- ~2003Stanford — B.S. in physics and mathematics.
- 2009Harvard — Ph.D. in physics, Aspects of Holography; advisor Nima Arkani-Hamed.
- 2009–12SLAC and Stanford — postdoctoral fellow.
- 2012Johns Hopkins — joins the Department of Physics and Astronomy.
- 2019OpenAI — joins as researcher.
- Jan 2020Israel Institute for Advanced Studies — three-part lecture series, Machine Learning I, II, III.
- Jan 23 2020arXiv:2001.08361 — Scaling Laws for Neural Language Models posted.
- 2020OpenAI — instrumental in GPT-3 and Codex development.
- 2021Anthropic — co-founds with Dario Amodei, Daniela Amodei, Sam McCandlish, et al.
- 2022DeepMind — Chinchilla paper (Hoffmann et al.) corrects the scaling-law coefficients.
- 2023–24Industry — “emergence as mirage” papers vindicate the smooth power-law thesis.
- Oct 2024Anthropic — appointed Responsible Scaling Officer.
- 2025Public warnings — Witten / Arkani-Hamed-level AI within years; 2027–2030 framed as humanity’s biggest decision.
- 2009Aspects of HolographyPh.D. dissertation, Harvard · advisor: Nima Arkani-Hamed
- Jan 2020Machine Learning I, II, IIIIsrael Institute for Advanced Studies · Lecture I →
- Jan 2020Scaling Laws for Neural Language ModelsarXiv:2001.08361 · with McCandlish, Henighan, Brown, Child, Gray, Radford, Wu, Amodei →
- 2022Constitutional AI: Harmlessness from AI FeedbackAnthropic (Bai et al.) — the boundary-encodes-bulk paper →
- 2023Anthropic’s Responsible Scaling PolicyAnthropic — the operational policy descended from the curves →
- 2025Are we ready for human-level AI by 2030?Kaplan interview, YouTube →
- 2025+Relative-Based Scaling Law for Neural Language ModelsarXiv:2510.20387 — the smoothness-vindication line →
Education. Stanford University (B.S., physics and mathematics). Harvard University (Ph.D., physics, 2009; thesis: Aspects of Holography; advisor: Nima Arkani-Hamed). Postdoctoral fellow at SLAC and Stanford.
Affiliations.Johns Hopkins University, Department of Physics and Astronomy (Associate Professor since 2012; currently on leave). OpenAI (researcher, 2019–2020). Anthropic (co-founder; Chief Science Officer; Responsible Scaling Officer since October 2024). Currently based in Pacifica, California.
Mentor.Nima Arkani-Hamed — whom Kaplan now publicly cites, by name, as the level AI may match within years.
Collaborators / peers worth naming. Dario Amodei (Hertz Fellow; Anthropic co-founder, CEO). Daniela Amodei (Anthropic co-founder, President). Sam McCandlish (Anthropic co-founder; Scaling Laws co-author). The full Scaling Laws author list: Sam McCandlish, Tom Henighan, Tom B. Brown, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
Honors. Sloan Foundation Fellowship. NSF CAREER grant. Hertz Fellow. Simons Collaboration on the Nonperturbative Bootstrap.
