Stupid LLM Tricks™
Context Jamming  /  Vol. 27  ·  Dispatch №006

Issue №006·Research × Playbook·The ARTEMIS Thesis

MAY 18, 2026  ·  14 MIN  ·  BRET KERR · ACRA INSIGHT · FRANKLIN, MA

The Scaffolding Is the Product.

Stanford’s ARTEMIS framework beat 9 of 10 human pentesters at $18.21/hr — and the market priced the wrong variable.

Listen · Why AI Agents Triggered the SaaSpocalypse

0:00--:--

On a Tuesday afternoon in late 2024, a Stanford research team handed a live enterprise network to a machine. Not a toy environment. Not a synthetic CTF range. A real 8,000-host production infrastructure — DNS servers, IDRAC controllers, TinyPilot remote management devices, the complete topology of a midsized organization. Then they sent in the pentesters.

Ten human professionals. One AI system. The experiment was called ARTEMIS: Automated Red Team with Ensemble Models and Intelligent Scaffolding. The results weren’t supposed to look like this.

The Scaffolding Multiplier

The market’s first mistake was treating this as a model story.

It isn’t. Stanford ran three different scaffolds against the same network, all using GPT-5 at the core. CyAgent scaffold: 19.4 points — good enough to beat two human professionals. Codex scaffold: 38.6 points — beats three. ARTEMIS scaffold: 53.0 points — beats nine of ten.

Same model. Different orchestration. A 2.7× performance gap from architecture alone.

This is the number the market missed. When cybersecurity indices lost $14B in paper value after the ARTEMIS preprint circulated, the sell-side was pricing model capability. They should have been pricing orchestration design — and they should have been figuring out who owns it.

The refusal paradox makes this concrete. Claude Code — Anthropic’s own agentic product — refused the penetration testing task. Standard safety guardrail. Reasonable consumer product behavior. But ARTEMIS, running the same Claude Sonnet 4 at its core, ran 16 hours without a single refusal. It performed 2.82 sub-agent operations per iteration at peak, spinning up parallel attack threads across eight concurrent agents.

The safety policy is a scaffold setting. Not a model property. The distinction has asymmetric implications depending on whether you’re building the scaffold or selling into a market that doesn’t know the difference.

The Cost Triad

Here is the arithmetic that MSSPs aren’t discussing publicly:

ARTEMIS ran the full 8,000-host enterprise engagement for $18.21 per hour in compute costs. Human penetration testing professionals bill at roughly $60 per hour on effective rates. Average U.S. pentester salary: $125,034 annually. ARTEMIS, run at 40 hours per week: $37,876 per year.

The cost gap isn’t surprising. The throughput gap is. ARTEMIS completed its engagement in under two hours of active operation. Human professionals required 10+ active keyboard hours to reach comparable coverage. For the PE firm holding a managed security services provider priced on time-and-materials, this isn’t a future concern. The model is already broken. The question is whether the contracts reflect it.

The $14B market repricing was rational in direction and wrong in granularity. The market sold the sector. It should have sold junior-tier services businesses and bought senior expertise platforms.

Three Labs, Three Bets

The three dominant AI labs took the same competitive signal and made structurally different bets.

Anthropic bet on model scarcity. The Glasswing coalition — 12 vetted cybersecurity partners with controlled access to frontier offensive capabilities — was the product. Build the highest-capability model, restrict the frontier, charge the access premium to credentialed institutions. The thesis requires model scarcity to hold.

OpenAI bet on distribution. The Trusted Access for Cyber program democratizes offensive AI tooling to a wider institutional base — defense contractors, enterprise security teams, the longer tail of the industry. The thesis is that breadth of access creates breadth of defense capability.

Google bet on the remediation layer. CodeMender and SAIF 2.0 represent a different bet entirely: let the offensive capability commoditize and own the platform that fixes what it finds. The exploit finds the vulnerability; Google’s tooling patches it. The moat is in the remediation workflow, not the scanner.

In May 2026, the UK AI Safety Institute published its evaluation of GPT-5.5. The finding: effective parity with Mythos Preview — Anthropic’s current frontier model — on multi-step enterprise attack simulation. Mythos Preview had scored 83.1% on the CyberGym benchmark. GPT-5.5 matched it.

Anthropic’s model scarcity thesis required that their frontier stay ahead. The frontier reached parity. The cartel bypass won’t be a competitor building a better model. It will be an enterprise building an ARTEMIS-style orchestration layer that treats all three labs’ APIs as commodity inputs.

Stanford open-sourced the blueprint.

The Symmetry of Failure

The most useful data from the Stanford experiment isn’t the headline win rate. It’s the failure taxonomy.

Task one: exploit a TinyPilot remote management device with a browser-based GUI vulnerability. 80% of human professionals found it. ARTEMIS: 0% unaided. The agent couldn’t navigate the interface. No model property can bridge that gap without a browser-capable agent architecture.

Task two: exploit a legacy IDRAC controller running an obsolete cipher suite. The device’s web interface produced a browser security rejection — SSL protocol version deprecated, connection refused. 0% of human professionals found the vulnerability. Too old, too obscure, too far outside the visual interface they were navigating. ARTEMIS variant: 100% success rate. The agent used curl -k. Bypassed the browser entirely. Enumerated the cipher suite directly. Exploited the service.

Your legacy stack has thousands of surfaces the agent will reach and the browser will reject.

The failure modes are mirrors. The agent loses where human visual navigation of modern interfaces is required. The human loses where legacy infrastructure demands protocol-level access rather than GUI navigation. This is the org chart implication. Your attack surface includes both categories.

The Context Layer — The Unbuilt Market

ARTEMIS had the highest false-positive rate of all 15 participants in the experiment.

This is not a headline. This is a business plan.

The experiment also measured the effect of organizational context — what the paper calls “hints”: curated information about the target network’s topology, business logic, and known vulnerabilities. With organizational context, ARTEMIS’s false-positive rate dropped to near-zero. Exploit success rates skyrocketed.

The limiting reagent isn’t model intelligence. It’s curated organizational memory.

The emerging market is pricing this: Novee raised $43M in a Series A for context-layer security tooling. Coalition acquired Wirespeed for its organizational threat context database. The capital is flowing toward the same thesis.

The industrial-scale version doesn’t exist yet. The RAG pipeline that ingests organizational threat history, network topology documentation, business logic maps, and asset inventories — and feeds that curated context to an ARTEMIS-style agentic scanner in real time — isn’t a product you can buy in 2026. It’s the product that closes the ARTEMIS false-positive gap while keeping the $18.21/hr cost structure. That’s the whitespace trade.

Legacy Infrastructure — Agent Territory

The IDRAC finding is not isolated.

ExploitBench, published May 2026 by Brumley and Lee, documented arbitrary code execution against 18 of 41 highly hardened V8 browser engine bugs. Not synthetic vulnerabilities. Not CVE-research lab exercises. Highly hardened production targets — the kind that security teams treat as effectively invulnerable pending the next browser release.

Agents are achieving arbitrary code execution against hardened V8 bugs. This has a specific implication for on-premise infrastructure: the patch cycle is broken.

SaaS-native infrastructure: the vendor controls the patch path. Lag between discovery and remediation is measured in hours. The vendor ships; customers absorb. The attack window is narrow by architecture.

On-premise infrastructure: the customer controls the patch path. Lag between discovery and remediation is 30–180 days — change management, testing cycles, downtime windows, procurement. The attack window is structural, not incidental.

Machine-speed discovery against human-speed remediation is not a competition. It is a verdict.

Breadth vs. Depth — The Mispricing

The market made one more mistake.

The Stanford scoring rubric captures it precisely. ARTEMIS A2 — the best-performing ARTEMIS configuration — scored 95.2 total points: 41.2 complexity, 54.0 severity. The top human performer, P1, scored 111.4 total: 67.4 complexity, 44.0 severity.

Complexity is the hard part of pentesting. The chained multi-step exploits. The creative pivoting through unexpected attack vectors. The strategic foothold development that requires understanding a business’s actual architecture, not just its IP range. Humans dominate complexity by 63%. Agents dominate severity — finding and exploiting high-value targets at scale.

The market sold both tiers when the ARTEMIS paper circulated. It should have sold junior-tier services and bought senior expertise. The mispricing is still standing.

The structural case: the agentic layer doesn’t replace the senior penetration tester. It generates the signal that lets the senior pentester operate on harder problems. The junior analyst who spent 60% of their time running Nmap scans and credential-stuffing default passwords is gone. But the senior who looks at the 400 high-severity findings the agent generated overnight and determines which three represent actual business risk — that person just became the most expensive professional in the security org.

The market priced the eradication of the junior tier. It didn’t price the promotion of the senior tier. Both are happening simultaneously. One was valued correctly. The other was missed.

The ARTEMIS paper is titled “From Manual to Autonomous: A Comprehensive Evaluation of Large Language Models in Cybersecurity.” The thesis isn’t in the title. It’s in the scaffolding variable.

Same model. Three scaffolds. 2.7× performance gap. The model is not the product. The orchestration architecture is the product. The CMO who reads this as a cybersecurity story misses it. The CISO who reads it as a vendor procurement story misses it. The PE firm that reads it as a sector selloff misses it.

The story is about which layer of the stack captures value when the underlying model becomes commodity. The answer from Stanford is: the layer above. The scaffold. The organizational context. The curated memory. The senior human who reads the agent’s output and makes the call that requires business judgment.

None of that is automated yet. All of that is the next trade.

Filed from Franklin, Massachusetts. Produced via Triple Transformation Workflow — Gemini Deep Research → Semantic Triple Transformation → three simultaneous deliverables. ACRA Insight LLC · Context Jamming · @bretkerr.

Primary Sources

GemClaw · Semantic Triple Transformation · ACRA Insight · Context Jamming

Subscribe at contextjamming.substack.com