Stupid LLM Tricks™
Context Jamming  /  Vol. 26  ·  Dispatch №003

Context Jamming·Vol. 26 · Dispatch·ACRA Insight LLC

Playbook · Generative Engine Optimization

For the Weights,Not the Algorithm.

Five moves that matter for AI crawlers — most of them free, most of them uncrowded, all of them shippable this weekend. Plus a forecast: LLM recommendations are going to eat organic search alive inside 24 months, and the agencies that see it are already repositioning.

Bret KerrACRA Insight · Franklin, MA18 April 20268 min read

Google used to be the only referral that mattered. You wrote for the algorithm, you chased ten blue links, you traded backlinks like baseball cards. That game is ending. The next version of “I searched for it” is “I asked Claude.”And when Claude answers, it either cites you, quotes you, or it doesn’t — and if it doesn’t, you don’t exist in the conversation. There is no page two. There is no SERP. There is one answer, and your only job is to be in it.

This is the beginning of Generative Engine Optimization — GEO, if you prefer the clean acronym — and the surface area you optimize for is no longer a ranking algorithm. It’s a training corpus and a retrieval index. Both of which you can actually influence, if you know where the levers are.

I spent a weekend turning my own site into an experiment in AI-crawler-friendliness. Here are the five moves that actually moved the needle, ranked by leverage.

§ 01

Write an llms.txt and put it at your root.

This is the single most-underused file on the modern web.

In September 2024, Jeremy Howard — of fast.ai and Answer.AI — proposed a new standard: /llms.txt, a markdown file living at your domain’s root that does for LLMs what /sitemap.xml does for search crawlers. A curated, human-written index of your site, in the format the model natively reads best: plain markdown, with a title, a blockquote summary, and link lists grouped by section.

Why it matters: an LLM crawler that arrives at your homepage has to parse HTML, strip navigation, guess at what’s important, and build a mental model from noise. An llms.txt hands it the map.Here’s the start of mine, which lives at contextjamming.com/llms.txt:

# GemClaw Debate Society

> Where machines argue and humans decide.
> An editorial portfolio and cross-model
> research engine...

## Core experiences
- [Homepage](https://www.contextjamming.com/): Two AI
  models. Three rounds. One proposition...
- [Start a Debate](.../debate/new): ...

## Editorial
- [Dispatches](.../dispatches): ...

That’s it. A title, a summary, curated sections of [link](url): description pairs. Claude, ChatGPT, Perplexity, and a growing list of retrieval pipelines check for this file on arrival. Almost no one has it yet. Build it this weekend and you’re in the top one percent of crawler-optimized sites on the open web. The spec is at llmstxt.org. It takes twenty minutes.

§ 02

Explicitly allow every AI bot in robots.txt.

Most sites fall into one of two camps: silent— no robots.txt, or the stock WordPress one — or hostile: the NYT, Reddit, and Getty have all loudly blocked GPTBot. Silent gets you crawled cautiously. Hostile gets you excluded entirely. You want a third posture: explicit welcome, by bot name.

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

That last one — CCBot— is the one most people miss. It’s the Common Crawl bot, and Common Crawl is the dataset most foundation models start from. Allowing CCBot is the single most load-bearing line in a modern robots.txt. While you’re in there, add PerplexityBot, ChatGPT-User, Claude-Web, Meta-ExternalAgent, Amazonbot, cohere-ai, Bytespider, MistralAI-User, YouBot, and Diffbot. The long tail matters.

§ 03

Earn backlinks from Common-Crawl-indexed sites.

This is the one you cannot solve with a markup file, and it is the one that matters most.

Common Crawl — the nonprofit web archive that underwrites the pretraining corpus of roughly every major LLM (GPT-3 was ~60% Common Crawl; the pattern holds downstream) — is a backlink-graph crawler. It does not take submissions. It does not have a “submit your URL” form. It follows links from sites it already knows about, and expands its graph from there.

Which means the path into the training corpus is not technical. It is distributional. You get into the next pretraining run by being linked to from sites Common Crawl already crawls:

Hacker News

Front page even briefly = near-certain CC inclusion. A single solid post can get you into the training corpus of every major model that ships in the following year.

Substack

Heavily crawled. Cross-posts carry backlinks. If you have a Substack already, every piece should link back to your canonical site.

Reddit & X

Still in CC despite the licensing noise. X profile bios and threaded posts both carry link weight. Put your URL in your bio.

GitHub READMEs

Enormously high-authority from CC’s perspective. A project README with a link to your essay is worth ten generic blog backlinks.

You are not writing for eyeballs anymore. You are writing for the weights. And the weights are shaped by the same backlink graph that shaped PageRank, twenty years ago, under a different name.

§ 04

Ship a sitemap and JSON-LD on every page.

Table stakes. Not sexy. Nonzero.

Sitemap (/sitemap.xml): tells every crawler — human search, AI training, retrieval — what routes exist and when they last changed. In Next.js App Router it’s literally a single file:

// app/sitemap.ts
export default function sitemap() {
  return [
    {
      url: 'https://yoursite.com/',
      lastModified: new Date(),
      priority: 1.0,
    },
    { url: '.../about', lastModified: new Date() },
    // ...
  ];
}

JSON-LD structured data (<script type="application/ld+json"> in your head): tells LLMs what each page is. For a blog you really only need four schema.org types: WebSite, Organization, Person, and Article. Retrieval-time LLMs — the ones answering “who wrote this?” and “when was this published?”— lean on JSON-LD heavily for attribution.

Neither of these gets you into a pretraining corpus on their own. They both improve your retrievabilityonce you’re there — and that’s the second half of the game. If a model has been trained on your content but does not know who wrote it or when, you get partial credit: the idea propagates, your name does not. JSON-LD fixes that.

§ 05

Archive your canonical URLs to the Wayback Machine.

The most underrated move on this list.

archive.org’s Wayback Machine is a first-class training source. It is also a hedge against your own link rot — if your site goes down or you move hosts, your content persists in a form models can still see. Submitting is trivial: go to web.archive.org/save, paste a URL, click Save. Or use their API. Do it for every canonical URL on your site. Do it again any time you ship a substantially updated post.

It takes about thirty seconds per URL, and it is one of the few things on this list with permanenteffect — once a snapshot exists, it is in the archive forever. Bonus: archived pages frequently appear in model citations with their web.archive.org/web/... URL. That is free backlink authority from one of the highest-reputation domains on the internet.

§ 06

Why SEO is running out of time.

Here is the part everyone is dancing around.

Traditional SEO is optimization for Google’s ranking algorithmon a surface — the ten blue links — that is, demonstrably, shrinking. Zero-click search (queries where the user never clicks through to a result) crossed fifty percent two years ago. AI Overviews — Google’s own Gemini-powered answer box — now eats the top third of most informational queries. Bing has Copilot. Apple will have Intelligence in every query you type into Safari by the end of 2026. ChatGPT’s search product has roughly four hundred million weekly active users.

Being the cited source in that one paragraph is worth more than being SERP position #1 used to be. And the path to being cited is not keyword density, not internal linking architecture, not Core Web Vitals. It’s three things:

Be in the training corpus

so the model has read you. This is the Common Crawl game. It is won with backlinks.

Be in the retrieval index

so the model can find you at query time. This is the sitemap + llms.txt + freshness game.

Be semantically legible

so the model can quote you cleanly. This is the JSON-LD + clean HTML + canonical-URL game.

The five moves above

feed exactly those three conditions. None of them are what a 2015-era SEO consultant would have told you to do. Most of them are free. All of them are still uncrowded.

The old SEO industry was a multi-billion-dollar ecosystem built around gaming a black box. The new one will be smaller, faster, and more honest — because the box you’re optimizing for will, in many cases, tell you what it is reading and what it remembers. You can literally ask Claude “what do you know about my site?” and get a real answer. Try that on Google.

We are somewhere between one and three product cycles away from LLM recommendationsbeing a larger referral source than organic search for content sites. The agencies that see this are already repositioning. The ones that don’t are about to have a very expensive decade.

§ 07

The punch list.

If you want to do the work, not read about it:

  1. 01

    Write a /llms.txt

    Twenty minutes. Enormous leverage. Spec at llmstxt.org.

  2. 02

    Rewrite /robots.txt

    Explicit allows for GPTBot, ClaudeBot, anthropic-ai, Google-Extended, PerplexityBot, Applebot-Extended, CCBot, plus the long tail.

  3. 03

    Ship a sitemap + JSON-LD

    Organization, Person, Article. Fifteen minutes per page type.

  4. 04

    Post your next piece somewhere backlink-rich

    HN, a Substack cross-post, a Reddit thread you’re welcome in. The distribution game is the real game.

  5. 05

    Archive every canonical URL to web.archive.org

    Thirty seconds per URL. Permanent effect. Do it today.

Do all five and you will be more crawler-optimized than ninety-five percent of the blogs in your niche. The sixth move — writing things worth quoting— is the one you were already trying to do anyway. It just matters differently now.

The blue links are leaving. The paragraph is staying. Get in the paragraph.

Filed from the training corpus

Bret Kerr

Context Jamming is a dispatch from ACRA Insight LLC on cross-model orchestration, AI safety, and the economics of the new cognitive stack.

GemClaw  ·  Generative Engine Optimization  ·  LLMs as Audience  ·  Training-Corpus Native

Subscribe at contextjamming.substack.com