CONTEXT JAMMING

Field notes from inside the context window.

§ · ACRA Insight · MoEA Pipeline (Mixture of Expert Agents)

The bottleneck is human judgment,
not silicon.

Expert preference labels routinely cost 50–500× the GPU compute of the DPO run they feed — because expert labels run $1.50–$100 each while a 7B DPO pass is ~$8–$32. One illustrative data point: $60k in labels, $360 in compute — 167× in that scenario, not a universal constant.

50–500×Expert label cost vs. the compute it trainsScenario-dependent, not a constant. A single data point: $60k labels / $360 compute run ≈ 167×.
64–68%LLM-judge agreement with subject-matter expertsThis is the ceiling synthetic-only labeling cannot cross (Szymanski et al., IUI 2025). A human gold set is non-optional.
~5–20×Defensible human-vs-MoEA labeling savingThis is the number to defend to an engineer. Assumes 5–15% expert gold set retained. Model the Ledger below.
§ 02 · The Harness

Compose an anchored research prompt

Configure domain, dimensions, and gold-set parameters below. The XML prompt regenerates live — every input change is deterministic and requires no API key.

Describes what model behavior this preference data will improve.
Preference dimensions
Expert review for 10% of pairs. Below 5%: inter-annotator drift risk. Above 20%: cost advantage shrinks.
Judge ensemble size
3 diverse families ≈ PoLL (Verga et al., EMNLP 2024) — 7–8× cheaper than a single GPT-4 judge while reducing intra-model bias.
deep_research_prompt.xml5 anchors
<deep_research_prompt domain="Clinical Summarization" objective="expert-domain-preference-labeling">
  <research_objective>Evaluate AI-generated clinical summaries for factual fidelity, completeness, and calibrated uncertainty — for use in post-training a clinical summarization model.</research_objective>

  <semantic_anchors>
    <!-- One LABEL-ANCHOR per dimension. The id is the downstream join key. -->
    <anchor id="clinical-summarization.factual-fidelity.01" dimension="Factual Fidelity" criterion="Does the summary accurately represent the source document without introducing false facts?" tier="escalate"/>
    <anchor id="clinical-summarization.contraindication-omission.02" dimension="Contraindication Omission" criterion="Are known contraindications, drug interactions, or safety warnings present and correctly stated?" tier="escalate"/>
    <anchor id="clinical-summarization.hedging-calibration.03" dimension="Hedging Calibration" criterion="Are uncertainty statements proportionate to evidence strength — no overclaiming or underclaiming?" tier="escalate"/>
    <anchor id="clinical-summarization.clinical-completeness.04" dimension="Clinical Completeness" criterion="Are all clinically actionable findings included without omission?" tier="escalate"/>
    <anchor id="clinical-summarization.readability-for-audience.05" dimension="Readability for Audience" criterion="Is the language appropriate for the intended clinical audience (attending, resident, or patient)?" tier="screen"/>
  </semantic_anchors>

  <retrieval_directives>
    <directive>Ground every claim in a retrievable, citable primary source. No source, no claim.</directive>
    <directive>Attach the matching LABEL-ANCHOR id to every finding so it ports to the labeling stage.</directive>
    <directive>For each anchored finding, emit a one-line rationale a domain expert could audit in under 30 seconds.</directive>
  </retrieval_directives>

  <labeling_handoff>
    <schema>For each prompt, emit a chosen/rejected pair scored on the anchored dimensions, with the supporting LABEL-ANCHOR ids and sources carried through.</schema>
    <output_format>JSONL: {"prompt","chosen","rejected","anchors":[...],"sources":[...],"rationale","dimension","tier"}</output_format>
  </labeling_handoff>

  <validation_gate>
    <gold_fraction>10%</gold_fraction>
    <judge_ensemble families="3" debias="position-swap"/>
    <kappa_threshold>0.60</kappa_threshold>
    <escalation_rule>Any pair whose dimension tier is "escalate", or where ensemble judges disagree, routes to the human gold set.</escalation_rule>
  </validation_gate>
</deep_research_prompt>

Stage 1 of the pipeline. Paste into Gemini Deep Research → anchored output ports into Stage 2 labeling.

§ 03 · The Handoff

Anchor → label join

The anchor id is what makes every label auditable — it traces a preference pick to a cited finding, which is the property human-only pipelines usually cannot produce at scale.

labeling_handoff_preview.jsonlFirst 2 dimensions · placeholder chosen/rejected
{"prompt":"[Example prompt for Clinical Summarization — Factual Fidelity]","chosen":"[Response that correctly satisfies: \"Does the summary accurately represent the source document without introducing false facts?\"]","rejected":"[Plausible response that fails: \"Does the summary accurately represent the source document without introducing false facts?\"]","anchors":["clinical-summarization.factual-fidelity.01"],"sources":["[Primary citable source for Factual Fidelity finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.factual-fidelity.01; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Factual Fidelity","tier":"escalate"}
{"prompt":"[Example prompt for Clinical Summarization — Contraindication Omission]","chosen":"[Response that correctly satisfies: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","rejected":"[Plausible response that fails: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","anchors":["clinical-summarization.contraindication-omission.02"],"sources":["[Primary citable source for Contraindication Omission finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.contraindication-omission.02; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Contraindication Omission","tier":"escalate"}
§ 04 · The Ledger

Invoice No. — MoEA vs. Human Labeling

Adjust the inputs; the receipt recalculates. At defaults (600 labels · $60/label · $0.30 synthetic · 10% gold) the defensible saving is ~9.5×. At $100/label the headline vs. DPO compute reaches ~167× — clearly labeled illustrative.

600
$60.00
$1.50 skilled annotator → $100 medical/legal expert
$0.30
Deep-research + ensemble judging, amortized
10%
$360
7B DPO pass: $8–$32. 70B: $360–$1,440
Filed · contextjamming.comInvoice No. MoEA-600-10
Human-only labeling600 × $60.00$36.0k
MoEA synthetic600 × $0.30$180.00
MoEA gold-set audit600 × 10% × $60.00$3.60k
MoEA total$3.78k
Defensible savingHuman ÷ MoEA total · the number to defend to an engineer
~9.5×
Headline vs. computeHuman ÷ DPO compute · ILLUSTRATIVE — compare only when context is clear
~100.0×

DPO compute ($8–$32 for 7B, ~$360 for 70B per Hugging Face estimates) is trivial and separate from labeling cost — the saving comes from expert review over only the audited fraction. Per-label rate reference: NextWealth / Lightly market surveys, 2024–2025.

§ 05 · Validation Backbone

The credibility artifact a technical buyer needs

Synthetic labeling without a validation backbone is a liability, not an asset. These four checks are the minimum for a regulated-industry pitch.

01

Agreement vs. gold set

Cohen's / Fleiss' κ and Krippendorff's α against a held-out expert set. Ship target: κ ≥ 0.60. Always report raw percentage agreement alongside kappa — kappa penalizes chance agreement in ways that matter here. Below 0.60: raise gold fraction toward 15–20%.

02

Judge-quality benchmarks

Run ensemble models through RewardBench 2, JudgeBench, and MT-Bench before deploying. A 3-family ensemble (e.g. Claude + Gemini + open-weight) ≈ PoLL (Verga et al., EMNLP 2024) and is 7–8× cheaper than a single GPT-4 judge while reducing intra-model bias.

03

Bias audit

Known biases and mitigations: Position bias (5–15%) → position-swap. Verbosity bias (10–20%) → length-normalization. Self-preference (10–25%) → cross-family judges. Without mitigations, a single-family ensemble amplifies these artifacts directly into the DPO data.

04

Downstream proof

Compare synthetic-labeled DPO model vs. human-labeled control on AlpacaEval 2 LC / Arena-Hard. Threshold: if the synthetic model underperforms the control by >3 points, raise gold fraction or narrow dimension scope before shipping.

§ 06 · Honest Limits

What this approach cannot do

Overclaiming kills the pitch faster than modest numbers do. Lead with these.

Every claim below has a citation. If you cannot defend it to a skeptical ML engineer or a regulated-industry buyer, do not make it.
  • Expert ceiling

    LLM judges agree with SMEs at 64–68% (Szymanski et al., IUI 2025). Synthetic-only labeling is unsafe for high-stakes expert judgment. Scope to screening + human escalation, not replacement.

    Szymanski et al., "Is Your LLM a Good Evaluator?" IUI 2025
  • Model collapse

    Training on purely synthetic preference data risks progressive quality degradation (Shumailov et al., Nature 2024). Apply labels to human-authored or retrieval-grounded content; mix real data; never fully close the loop.

    Shumailov et al., "AI models collapse when trained on recursively generated data", Nature 2024
  • Reward hacking

    Naive multi-model preference data can worsen DPO safety alignment ("More is Less", arXiv 2504.02193). A swarm is not automatically better than one well-benchmarked judge. Validate every new judge family before adding it to the ensemble.

    arXiv 2504.02193, "More is Less: Scaling Multi-Agent DPO…"
  • Sycophancy

    RLHF / DPO pipelines can amplify sycophancy — models that sound agreeable rather than accurate. Audit chosen/rejected pairs specifically for this pattern (Sharma et al. 2024).

    Sharma et al., "Towards Understanding Sycophancy in Language Models", 2024
  • Regulated domains

    FDA SaMD guidance and legal-practice regulations require human-in-the-loop for consequential decisions. Pitch MoEA-labeler as pre-screening only in clinical/legal contexts — not as a replacement for licensed expert review.

    FDA SaMD guidance; ABA Model Rules on competent supervision
  • Open evidence gap

    No published study has measured a multi-LLM judge swarm closing the expert-agreement gap specifically in medicine or law at production scale. We say so. The gap between 68% synthetic ceiling and what regulated buyers need is the active research frontier.

    State of field as of June 2026
§ 07 · Skill & Credits

Download the SKILL.md

Apache 2.0. Plug into Claude Code, Cursor, Copilot, or Gemini CLI to compose anchored deep-research prompts from a domain + labeling objective.

moea-labeler

Given a domain and labeling objective, emits an XML deep-research prompt with LABEL-ANCHOR semantic anchors and the downstream JSONL labeling schema with validation gate. Deterministic — no API key required.

Domain + labeling objective
   ↓
[moea-labeler]  ← Stage 1 (Claude Code)
   ↓
XML prompt w/ LABEL-ANCHORs → Gemini Deep Research
   ↓
Anchored, cited findings
   ↓
[preference-labeling handoff] → DPO-ready JSONL  ← Stage 2
   ↓
Validation gate (κ ≥ 0.6 vs human gold) → ship or escalate
↓ Download moea-labeler.skill.md

System position: the Labeler is the preference-data layer of the MoEA stack — MoEA Loop is the typed-recursion orchestration layer, and the skill arsenal is the open-source distribution layer. Same anchor taxonomy end to end.

Built by Bret Kerr · ACRA Insight LLC · Franklin, MA

MoEA-labeler is a preference-data layer that complements Red Hat InstructLab's instruction/knowledge layer — open, on-prem (IBM Granite via Ollama as local validation runtime), auditable, regulated-industry-ready. Gemini API credits fit via the Google for Startups program at The Open Accelerator.

Apache 2.0 · github.com/BretKerrAI/founderfile

§ · Invoice No. 001 · The Build Ledger

The Ledger.

Filed · contextjamming.com

What a conservative mid-market digital agency would have quoted for the same scope, itemized against what this site actually cost. Agency numbers are the floor — not the premium brand-studio tier.

TIME

12 weeks

2 days

~42× faster

COST

~$150,000

~$300

~500× cheaper

TEAM

5-person agency

1 human + 3 models

Same deliverable

§ Itemized — what a mid-market agency SOW would have billed

Discovery · brand positioning · workshops40–80 hr$10,000
Design system · Figma tokens · 3 rounds60–120 hr$18,000
Wavesurfer audio carousel · single-track context60–100 hr$16,000
Dual lightbox systems · focus trap · keyboard30–50 hr$8,000
LLM product flows · streaming · state machine80–160 hr$26,000
Stripe · checkout · webhooks · env hardening40–80 hr$10,000
Editorial routes · 6 sub-pages · templates60–100 hr$14,000
Accessibility pass · aria · reduced-motion40–80 hr$10,000
QA · cross-browser · mobile matrix60–100 hr$14,000
Cross-publication rebrand · masthead + IA · 2026-04-2820–40 hr$6,000
Subtotal~700 hr$126,000
Project management · 18% overhead$24,000
Agency total — conservative floor~700 hr~$150,000
Actually spent · Claude + Gemini stack~20 hr~$300

Agency figure assumes ~700 billable hours at $200/hr blended, plus ~18% PM overhead — the conservative floor of a mid-market SOW. Premium brand studios would have quoted 2–3× that. Stack: Antigravity (orchestrator), Claude Opus 4.8 (auditor), Codex (adversary), Cloudflare Workers / OpenNext.

§   Colophon

How this site is made.

Vol. 26 · build log

Every page on contextjamming.com is the output of a real-time, three-body Mixture-of-Experts loop. One model orchestrates. Two consult. The human holds the thesis. No single model commits alone.

View Redesign Assessment →

Orchestrator

Antigravity

Google DeepMind

  • Primary author
  • Terminal-native, direct push to Cloudflare
  • Audit trail to GitHub on every commit
  • Adaptive thinking · effort: extra-high

Auditor

Claude Opus 4.8

1M context

  • Editorial critic
  • Code review before merge
  • Backup-of-record
  • Co-signs every commit

Adversary

Codex

Cross-model MoE

  • Factual adjudication
  • Structural dissent
  • Deep Research → semantic triples
  • Caught the Donelan incident

Stack

Next.js
16.2 · App Router
React
19.2
TypeScript
5
Tailwind
v4 · @theme inline
@opennextjs/cloudflare
adapter
wrangler
Pages deploy
framer-motion
transitions
wavesurfer.js
audio waveforms

Typeset in

Fraunces
variable · opsz + SOFT
Playfair Display
debate display
IBM Plex Mono
editorial metadata
Geist Mono
utility mono
Caveat
grease-pencil marginalia
All via
next/font/google
Palette
single @theme block
No dupe tokens
ever

Infrastructure

Deploy
Cloudflare Workers / OpenNext
ISR
30-min revalidate · Cloudflare-served
Repo
github.com/BretKerrAI/founderfile
Branch
main
Analytics
Google Tag Manager
Apex
contextjamming.com
Runtime
Node 24
Build tool
Turbopack
       human intent
            │
            ▼
   ┌────────────────────┐         ┌─────────────────┐
   │    Antigravity     │  ◄────► │ Claude Opus 4.8 │      ← auditor loop
   │    (orchestrator)  │         │     (auditor)   │
   └─────────┬──────────┘         └─────────────────┘
             │  ◄───────────┐
             ▼              │
       ┌──────────┐    ┌────┴───────┐
       │Cloudflare│    │   Codex    │          ← adversarial loop
       │ Workers  │    │            │
       └─────┬────┘    └────────────┘
             │
             ▼
       contextjamming.com
             │
             ▼
       ┌──────────────┐
       │   Git push   │         ← audit trail
       └──────────────┘
Assembled on Mac in Terminal · Filed from Franklin, MAContext Jamming · ACRA Insight LLC · MIT License · FounderFile.ai · RelationalIntelligence.xyz · Commission a Dispatch →