CONTEXT JAMMING

Field notes from inside the context window.

Stupid LLM Tricks™
Context Jamming  /  Vol. 26  ·  Dispatch №007

Context Jamming·Vol. 26 · Dispatch·ACRA Insight LLC

Analysis × Biology's Infrastructure

The Paved Jungle and the Walled Factory.

Two of the most ambitious companies in AI looked at the same problem in biology and built opposite machines. Only one of them thinks the public record is worth saving.

Bret KerrACRA Insight · Context Jamming · Franklin, MA09 June 2026
Hero art: The Paved Jungle and the Walled Factory
Hero · The Paved Jungle vs. The Walled FactoryVol. 26 · Dispatch №007

Sometime in late 2025, inside an evaluation log, an AI agent was handed a simple instruction. Then it was handed the same instruction again. Then a third time. The prompt never changed — down to the punctuation, it asked the agent to pull the complete set of Zaire ebolavirus sequences matching a fixed list of metadata filters.

The first run returned 106 sequences. The second returned 15. The third returned 5.

Nothing had moved. The database was identical, the model was identical, the query was identical. The agent had simply wandered off somewhere different each time, like a visitor who keeps getting lost in the same old city and comes back insisting the streets must have rearranged themselves overnight.

The punchline arrived downstream. When researchers fed those three datasets into a phylogenetic tree — the workhorse analysis for estimating when and where an outbreak actually began — the answers weren’t merely off. They were impossible. One thin dataset shoved the inferred origin of the 2014 West African Ebola epidemic back to the year 1922. Another, which looked entirely respectable to an untrained eye, quietly dropped every viral sequence collected in Guinea and slid the epidemic’s birthday forward to April 2014 — long after it had really started, in a country the dataset had erased.

The industry likes to tell a story about itself in which the only thing that matters is intelligence: bigger models, more parameters, longer context windows. The Ebola logs tell a different story. Across VirBench — a hand-curated benchmark of 120 realistic viral queries spanning 40 pathogens — the agents did not fail because they were dumb. The execution traces are unambiguous on this point. They understood the biology perfectly. They failed because the data they needed was locked inside a city built for pedestrians, and they were trying to drive through it at highway speed.

§ 01

The Click Tax & The Walkable Medieval Town

To understand why a state-of-the-art model returns three different answers to one question, you have to understand the shape of the place it’s driving through. Biology’s data infrastructure was eloquently described, by the people trying to fix it, as an old historical city — thoughtfully built, dense with value, and laced with winding, narrow streets that were never meant for anything moving fast. The value is real. The navigability, for a machine, is close to zero.

Software engineering has spent decades paving itself: documented APIs, version control, package managers, outputs you can compile and test. Computational biology stayed a walkable medieval town. Its records live behind web portals and graphical interfaces, its metadata conventions are implicit, its filtering logic frequently exists only inside the front end of a website and nowhere a program can reach it. In working virology labs, the recipe for a clean dataset still circulates as a long list of point-and-click filters that researchers reproduce by hand.

Drop an autonomous agent into that environment and it pays what amounts to a tax on every click it can’t make. To build one valid viral dataset — right host organism, right collection site, right date window, minimum sequence length, no lab-passaged samples — an agent has to stitch together NCBI’s REST, Datasets, and E-utilities APIs, reconcile clashing taxonomic IDs, manage pagination by hand, and haul down hundreds of gigabytes only to throw most of it away. And even then, the APIs frequently don’t expose the same filters the website offers a human. The agent isn’t reasoning poorly. It’s being asked to commute through a city that actively resists vehicles.

§ 02

The Number That Should Worry the Scaling Maximalists

Here is the result that reframes the entire debate. Turned loose on VirBench with nothing but web search and a Python interpreter, Claude Sonnet 4 resolved the queries with a mean retrieval accuracy of 16.9 percent. The errors split into two species — under-counting, when an agent botched pagination on a huge dataset like SARS-CoV-2 or Influenza A and simply stopped early, and over-counting, when it failed to exclude lab-passaged samples through a badly documented endpoint.

Then Anthropic’s team, working directly with NCBI’s own database architects, gave the agents a tool called gget virus — a deterministic query layer that formalizes the website’s hidden filtering logic into something a machine can call reproducibly. It stages retrieval so the agent downloads metadata first and full records only for sequences that survive the filter, cutting raw data transfer by more than 98 percent. It handles pagination natively. It returns auditable logs.

Sonnet 4’s accuracy went from 16.9 percent to north of 90. Run-to-run stability climbed into a near-perfect 0.92-to-1.00 band. The reproducibility collapse — 106, then 15, then 5 — simply stopped happening.

But the line that should keep the “just scale the model” crowd up at night is what happened to the gap between models. Once the rote work of deciphering the database was offloaded to reliable tooling, the choice of underlying LLM stopped mattering very much. GPT-5.5, already strong at 91.3 percent bare, reached 99.7 with the tool. A cheap mid-tier model paired with gget virus now does what an expensive frontier model does alone. Anthropic’s own framing for this is almost subversive coming from a frontier lab: build the context engine, the boringly reliable layer that does the mechanical retrieval, and you can stop paying for the most expensive reasoning on earth just to read a database correctly.

§ 03

The Infrastructure Denial Trap

Bret Kerr has a name for the failure mode underneath all of this — the Infrastructure Denial Trap. A powerful reasoning model is functionally neutralized when it’s denied fluid, native access to a domain’s plumbing. In biology, the trap doesn’t need a malicious operator. The decades of siloed, unstandardized public data — the Ghost of the Digital Jungle haunting the INSDC ecosystem — act as a passive denial-of-service against any agent that comes near them. You can train your model with all the Constitutional AI rigor you like; if the streets aren’t paved, it crashes.

Anthropic’s answer is to pave them. Building gget virus alongside NCBI is, in effect, a public-works project for the scientific commons — retrofitting the jungle with roads so the public, open databases stay the foundation everyone builds on. It is an unfashionably civic move: keep the commons usable, and let even a resource-limited lab in the path of an outbreak query global viral knowledge accurately.

Lila Sciences looked at the same trap and decided not to fix the city at all. Why retrofit broken legacy infrastructure when you can build a parallel one from scratch? That’s the second route out — not paving the jungle but a coordinated land grab around it.

§ 04

The Walled Factory & The Bitter Lesson

Lila Sciences, founded in 2023 out of Flagship Pioneering, is not in the business of reading public databases. It is in the business of making the public database irrelevant. The company has raised against this idea with startling velocity — a $200 million seed, a $350 million Series A, and a reported Series B near $2 billion at roughly a $10.5 billion post-money valuation, anchored by CalPERS and NVentures. Its stated goal is not a better model. It is “Scientific Superintelligence”: a system that runs every step of the scientific method — observe, hypothesize, design, test, refine — autonomously, faster than people can.

The mechanism is a closed cyber-physical loop. A proprietary reasoning model called Lila Iris proposes experiments and writes the protocols; a fleet of AI Science Factories — robotic, end-to-end wet-labs built for machines rather than human hands — runs them physically; and the results flow straight back into Iris in real time to refine the next iteration. CEO Geoffrey von Maltzahn and CTO Andrew Beam preside over a bench that reads like a who’s-who of the field — Noubar Afeyan on the board, robotics chief Julie Shah, open-endedness SVP Kenneth Stanley, Rafael Gomez-Bombarelli, TW Killian. The whole apparatus is a bet on one of the oldest and most uncomfortable ideas in AI.

That idea is Rich Sutton’s Bitter Lesson: methods leaning on hand-engineered human knowledge are, eventually and inevitably, beaten by general methods that ride raw compute and continuous learning. Gomez-Bombarelli calls its life-sciences version a bittersweet lesson — scalable simulation is wildly powerful, but a theoretical molecule only matters once it’s physically synthesized and test-checked, and that step has always been agonizingly slow. Lila’s wager is that an AI Science Factory can run self-play in the physical world: the system designs reagents and sequences, the robots build and test them, the verified outcomes train the policy, and an auto-curriculum emerges with no human-curated dataset in the loop. The company reports its DNA Design agent has hit a 900-fold increase in validation throughput at a verified 100 percent reliability rate for specific synthesis tasks. Where almost every other AI system stops at prediction and hands a human months of lab work, Lila uses the factory as a physical verifier and never stops.

This is a moat made of pristine, machine-native data that competitors simply cannot replicate with open-source tools — and it comes with its own toll booth. As Lila generates validated proprietary data through endless physical self-play, the relative value of human-curated public databases dilutes, and everyone outside the walls pays what Kerr calls a Competitor Tax for access to a frictionless ecosystem they can’t build themselves. The landscape bifurcates cleanly: ride open models plus deterministic bridges to navigate the historical record, or migrate into a closed, capitalized, superintelligent factory where the data is born for the machine.

§ 05

The Physical Governance Gap

Both paths point at the same destination — Dario Amodei's “compressed 21st century,” in which agentic AI folds 50 to 100 years of biological progress into 5 to 10 by killing latency at both ends of the pipeline. Anthropic’s tools erase the latency of gathering data; Lila’s factories erase the latency of generating it. The human scientist, in this telling, stops pipetting and stops reproducing browser filters and moves up the stack to intention, prioritization, and judgment.

But run the same logic forward and the unsolved problem comes into focus, and it is not a productivity problem. The very thing that lets a small lab accurately track the 2026 Bundibugyo outbreak — perfect, reproducible, programmatic access to all of human viral knowledge — is also the thing that lowers the barrier for someone who wants to analyze or engineer a pathogen. Pave the roads for the good agents and you pave them for the bad ones. And as autonomous wet-labs scale, the question stops being what does the model say and becomes what is the robot physically synthesizing, right now, with no human in the room.

That’s the white space — the gap between a governance philosophy that lives in language and a future where the consequential decisions are made in glassware. The companies have spent their genius paving the jungle and walling the factory. The harder construction project — deciding who is allowed to drive, and where — hasn’t broken ground.

Filed from the field

Bret Kerr

Context Jamming is a dispatch from ACRA Insight LLC on cross-model orchestration, AI safety, and the economics of the new cognitive stack.

GemClaw  ·  Semantic Triple Transformation  ·  LMaaS  ·  Architectural Determinism

Subscribe at contextjamming.substack.com