§ · ACRA Insight · MoEA Pipeline (Mixture of Expert Agents)

The bottleneck is human judgment,
not silicon.

Expert preference labels routinely cost 50–500× the GPU compute of the DPO run they feed — because expert labels run $1.50–$100 each while a 7B DPO pass is ~$8–$32. One illustrative data point: $60k in labels, $360 in compute — 167× in that scenario, not a universal constant.

50–500×Expert label cost vs. the compute it trainsScenario-dependent, not a constant. A single data point: $60k labels / $360 compute run ≈ 167×.

64–68%LLM-judge agreement with subject-matter expertsThis is the ceiling synthetic-only labeling cannot cross (Szymanski et al., IUI 2025). A human gold set is non-optional.

~5–20×Defensible human-vs-MoEA labeling savingThis is the number to defend to an engineer. Assumes 5–15% expert gold set retained. Model the Ledger below.

§ 02 · The Harness

Compose an anchored research prompt

Configure domain, dimensions, and gold-set parameters below. The XML prompt regenerates live — every input change is deterministic and requires no API key.

Domain

Labeling objectiveDescribes what model behavior this preference data will improve.

Preference dimensions

Does the summary accurately represent the source document without introducing false facts?
Are known contraindications, drug interactions, or safety warnings present and correctly stated?
Are uncertainty statements proportionate to evidence strength — no overclaiming or underclaiming?
Are all clinically actionable findings included without omission?
Is the language appropriate for the intended clinical audience (attending, resident, or patient)?

Gold-set fraction: 10%Expert review for 10% of pairs. Below 5%: inter-annotator drift risk. Above 20%: cost advantage shrinks.

Judge ensemble size

Single3 families ★5 families

3 diverse families ≈ PoLL (Verga et al., EMNLP 2024) — 7–8× cheaper than a single GPT-4 judge while reducing intra-model bias.

Agentic compose (BYO key)

deep_research_prompt.xml5 anchors

<deep_research_prompt domain="Clinical Summarization" objective="expert-domain-preference-labeling">
  <research_objective>Evaluate AI-generated clinical summaries for factual fidelity, completeness, and calibrated uncertainty — for use in post-training a clinical summarization model.</research_objective>

  <semantic_anchors>
    <!-- One LABEL-ANCHOR per dimension. The id is the downstream join key. -->
    <anchor id="clinical-summarization.factual-fidelity.01" dimension="Factual Fidelity" criterion="Does the summary accurately represent the source document without introducing false facts?" tier="escalate"/>
    <anchor id="clinical-summarization.contraindication-omission.02" dimension="Contraindication Omission" criterion="Are known contraindications, drug interactions, or safety warnings present and correctly stated?" tier="escalate"/>
    <anchor id="clinical-summarization.hedging-calibration.03" dimension="Hedging Calibration" criterion="Are uncertainty statements proportionate to evidence strength — no overclaiming or underclaiming?" tier="escalate"/>
    <anchor id="clinical-summarization.clinical-completeness.04" dimension="Clinical Completeness" criterion="Are all clinically actionable findings included without omission?" tier="escalate"/>
    <anchor id="clinical-summarization.readability-for-audience.05" dimension="Readability for Audience" criterion="Is the language appropriate for the intended clinical audience (attending, resident, or patient)?" tier="screen"/>
  </semantic_anchors>

  <retrieval_directives>
    <directive>Ground every claim in a retrievable, citable primary source. No source, no claim.</directive>
    <directive>Attach the matching LABEL-ANCHOR id to every finding so it ports to the labeling stage.</directive>
    <directive>For each anchored finding, emit a one-line rationale a domain expert could audit in under 30 seconds.</directive>
  </retrieval_directives>

  <labeling_handoff>
    <schema>For each prompt, emit a chosen/rejected pair scored on the anchored dimensions, with the supporting LABEL-ANCHOR ids and sources carried through.</schema>
    <output_format>JSONL: {"prompt","chosen","rejected","anchors":[...],"sources":[...],"rationale","dimension","tier"}</output_format>
  </labeling_handoff>

  <validation_gate>
    <gold_fraction>10%</gold_fraction>
    <judge_ensemble families="3" debias="position-swap"/>
    <kappa_threshold>0.60</kappa_threshold>
    <escalation_rule>Any pair whose dimension tier is "escalate", or where ensemble judges disagree, routes to the human gold set.</escalation_rule>
  </validation_gate>
</deep_research_prompt>

Stage 1 of the pipeline. Paste into Gemini Deep Research → anchored output ports into Stage 2 labeling.

§ 03 · The Handoff

Anchor → label join

The anchor id is what makes every label auditable — it traces a preference pick to a cited finding, which is the property human-only pipelines usually cannot produce at scale.

labeling_handoff_preview.jsonlFirst 2 dimensions · placeholder chosen/rejected

{"prompt":"[Example prompt for Clinical Summarization — Factual Fidelity]","chosen":"[Response that correctly satisfies: \"Does the summary accurately represent the source document without introducing false facts?\"]","rejected":"[Plausible response that fails: \"Does the summary accurately represent the source document without introducing false facts?\"]","anchors":["clinical-summarization.factual-fidelity.01"],"sources":["[Primary citable source for Factual Fidelity finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.factual-fidelity.01; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Factual Fidelity","tier":"escalate"}
{"prompt":"[Example prompt for Clinical Summarization — Contraindication Omission]","chosen":"[Response that correctly satisfies: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","rejected":"[Plausible response that fails: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","anchors":["clinical-summarization.contraindication-omission.02"],"sources":["[Primary citable source for Contraindication Omission finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.contraindication-omission.02; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Contraindication Omission","tier":"escalate"}

§ 04 · The Ledger

Invoice No. — MoEA vs. Human Labeling

Adjust the inputs; the receipt recalculates. At defaults (600 labels · $60/label · $0.30 synthetic · 10% gold) the defensible saving is ~9.5×. At $100/label the headline vs. DPO compute reaches ~167× — clearly labeled illustrative.

Label count600

Expert rate ($/label)$60.00

$1.50 skilled annotator → $100 medical/legal expert

Synthetic cost ($/label)$0.30

Deep-research + ensemble judging, amortized

Gold-set fraction: 10%10%

DPO compute run ($)$360

7B DPO pass: $8–$32. 70B: $360–$1,440

Filed · contextjamming.comInvoice No. MoEA-600-10

Human-only labeling600 × $60.00$36.0k

MoEA synthetic600 × $0.30$180.00

MoEA gold-set audit600 × 10% × $60.00$3.60k

MoEA total$3.78k

Defensible savingHuman ÷ MoEA total · the number to defend to an engineer

~9.5×

Headline vs. computeHuman ÷ DPO compute · ILLUSTRATIVE — compare only when context is clear

~100.0×

DPO compute ($8–$32 for 7B, ~$360 for 70B per Hugging Face estimates) is trivial and separate from labeling cost — the saving comes from expert review over only the audited fraction. Per-label rate reference: NextWealth / Lightly market surveys, 2024–2025.

§ 05 · Validation Backbone

The credibility artifact a technical buyer needs

Synthetic labeling without a validation backbone is a liability, not an asset. These four checks are the minimum for a regulated-industry pitch.

Agreement vs. gold set

Cohen's / Fleiss' κ and Krippendorff's α against a held-out expert set. Ship target: κ ≥ 0.60. Always report raw percentage agreement alongside kappa — kappa penalizes chance agreement in ways that matter here. Below 0.60: raise gold fraction toward 15–20%.

Judge-quality benchmarks

Run ensemble models through RewardBench 2, JudgeBench, and MT-Bench before deploying. A 3-family ensemble (e.g. Claude + Gemini + open-weight) ≈ PoLL (Verga et al., EMNLP 2024) and is 7–8× cheaper than a single GPT-4 judge while reducing intra-model bias.

Bias audit

Known biases and mitigations: Position bias (5–15%) → position-swap. Verbosity bias (10–20%) → length-normalization. Self-preference (10–25%) → cross-family judges. Without mitigations, a single-family ensemble amplifies these artifacts directly into the DPO data.

Downstream proof

Compare synthetic-labeled DPO model vs. human-labeled control on AlpacaEval 2 LC / Arena-Hard. Threshold: if the synthetic model underperforms the control by >3 points, raise gold fraction or narrow dimension scope before shipping.

§ 06 · Honest Limits

What this approach cannot do

Overclaiming kills the pitch faster than modest numbers do. Lead with these.

Every claim below has a citation. If you cannot defend it to a skeptical ML engineer or a regulated-industry buyer, do not make it.

Expert ceiling
LLM judges agree with SMEs at 64–68% (Szymanski et al., IUI 2025). Synthetic-only labeling is unsafe for high-stakes expert judgment. Scope to screening + human escalation, not replacement.
Szymanski et al., "Is Your LLM a Good Evaluator?" IUI 2025
Model collapse
Training on purely synthetic preference data risks progressive quality degradation (Shumailov et al., Nature 2024). Apply labels to human-authored or retrieval-grounded content; mix real data; never fully close the loop.
Shumailov et al., "AI models collapse when trained on recursively generated data", Nature 2024
Reward hacking
Naive multi-model preference data can worsen DPO safety alignment ("More is Less", arXiv 2504.02193). A swarm is not automatically better than one well-benchmarked judge. Validate every new judge family before adding it to the ensemble.
arXiv 2504.02193, "More is Less: Scaling Multi-Agent DPO…"
Sycophancy
RLHF / DPO pipelines can amplify sycophancy — models that sound agreeable rather than accurate. Audit chosen/rejected pairs specifically for this pattern (Sharma et al. 2024).
Sharma et al., "Towards Understanding Sycophancy in Language Models", 2024
Regulated domains
FDA SaMD guidance and legal-practice regulations require human-in-the-loop for consequential decisions. Pitch MoEA-labeler as pre-screening only in clinical/legal contexts — not as a replacement for licensed expert review.
FDA SaMD guidance; ABA Model Rules on competent supervision
Open evidence gap
No published study has measured a multi-LLM judge swarm closing the expert-agreement gap specifically in medicine or law at production scale. We say so. The gap between 68% synthetic ceiling and what regulated buyers need is the active research frontier.
State of field as of June 2026

§ 07 · Skill & Credits

Download the SKILL.md

Apache 2.0. Plug into Claude Code, Cursor, Copilot, or Gemini CLI to compose anchored deep-research prompts from a domain + labeling objective.

moea-labeler

Given a domain and labeling objective, emits an XML deep-research prompt with LABEL-ANCHOR semantic anchors and the downstream JSONL labeling schema with validation gate. Deterministic — no API key required.

Domain + labeling objective
   ↓
[moea-labeler]  ← Stage 1 (Claude Code)
   ↓
XML prompt w/ LABEL-ANCHORs → Gemini Deep Research
   ↓
Anchored, cited findings
   ↓
[preference-labeling handoff] → DPO-ready JSONL  ← Stage 2
   ↓
Validation gate (κ ≥ 0.6 vs human gold) → ship or escalate

↓ Download moea-labeler.skill.md

System position: the Labeler is the preference-data layer of the MoEA stack — MoEA Loop is the typed-recursion orchestration layer, and the skill arsenal is the open-source distribution layer. Same anchor taxonomy end to end.

Built by Bret Kerr · ACRA Insight LLC · Franklin, MA

MoEA-labeler is a preference-data layer that complements Red Hat InstructLab's instruction/knowledge layer — open, on-prem (IBM Granite via Ollama as local validation runtime), auditable, regulated-industry-ready. Gemini API credits fit via the Google for Startups program at The Open Accelerator.

Apache 2.0 · github.com/BretKerrAI/founderfile

CONTEXT JAMMING

The bottleneck is human judgment,
not silicon.

Compose an anchored research prompt

Anchor → label join

Invoice No. — MoEA vs. Human Labeling

The credibility artifact a technical buyer needs

Agreement vs. gold set

Judge-quality benchmarks

Bias audit

Downstream proof

What this approach cannot do

Download the SKILL.md

The Ledger.

How this site is made.

Antigravity

Claude Opus 4.8

Codex

The bottleneck is human judgment,not silicon.

Compose an anchored research prompt

Anchor → label join

Invoice No. — MoEA vs. Human Labeling

The credibility artifact a technical buyer needs

Agreement vs. gold set

Judge-quality benchmarks

Bias audit

Downstream proof

What this approach cannot do

Download the SKILL.md

The Ledger.

How this site is made.

Antigravity

Claude Opus 4.8

Codex

The bottleneck is human judgment,
not silicon.