The bottleneck is human judgment,
not silicon.
Expert preference labels routinely cost 50–500× the GPU compute of the DPO run they feed — because expert labels run $1.50–$100 each while a 7B DPO pass is ~$8–$32. One illustrative data point: $60k in labels, $360 in compute — 167× in that scenario, not a universal constant.
Compose an anchored research prompt
Configure domain, dimensions, and gold-set parameters below. The XML prompt regenerates live — every input change is deterministic and requires no API key.
<deep_research_prompt domain="Clinical Summarization" objective="expert-domain-preference-labeling">
<research_objective>Evaluate AI-generated clinical summaries for factual fidelity, completeness, and calibrated uncertainty — for use in post-training a clinical summarization model.</research_objective>
<semantic_anchors>
<!-- One LABEL-ANCHOR per dimension. The id is the downstream join key. -->
<anchor id="clinical-summarization.factual-fidelity.01" dimension="Factual Fidelity" criterion="Does the summary accurately represent the source document without introducing false facts?" tier="escalate"/>
<anchor id="clinical-summarization.contraindication-omission.02" dimension="Contraindication Omission" criterion="Are known contraindications, drug interactions, or safety warnings present and correctly stated?" tier="escalate"/>
<anchor id="clinical-summarization.hedging-calibration.03" dimension="Hedging Calibration" criterion="Are uncertainty statements proportionate to evidence strength — no overclaiming or underclaiming?" tier="escalate"/>
<anchor id="clinical-summarization.clinical-completeness.04" dimension="Clinical Completeness" criterion="Are all clinically actionable findings included without omission?" tier="escalate"/>
<anchor id="clinical-summarization.readability-for-audience.05" dimension="Readability for Audience" criterion="Is the language appropriate for the intended clinical audience (attending, resident, or patient)?" tier="screen"/>
</semantic_anchors>
<retrieval_directives>
<directive>Ground every claim in a retrievable, citable primary source. No source, no claim.</directive>
<directive>Attach the matching LABEL-ANCHOR id to every finding so it ports to the labeling stage.</directive>
<directive>For each anchored finding, emit a one-line rationale a domain expert could audit in under 30 seconds.</directive>
</retrieval_directives>
<labeling_handoff>
<schema>For each prompt, emit a chosen/rejected pair scored on the anchored dimensions, with the supporting LABEL-ANCHOR ids and sources carried through.</schema>
<output_format>JSONL: {"prompt","chosen","rejected","anchors":[...],"sources":[...],"rationale","dimension","tier"}</output_format>
</labeling_handoff>
<validation_gate>
<gold_fraction>10%</gold_fraction>
<judge_ensemble families="3" debias="position-swap"/>
<kappa_threshold>0.60</kappa_threshold>
<escalation_rule>Any pair whose dimension tier is "escalate", or where ensemble judges disagree, routes to the human gold set.</escalation_rule>
</validation_gate>
</deep_research_prompt>Stage 1 of the pipeline. Paste into Gemini Deep Research → anchored output ports into Stage 2 labeling.
Anchor → label join
The anchor id is what makes every label auditable — it traces a preference pick to a cited finding, which is the property human-only pipelines usually cannot produce at scale.
{"prompt":"[Example prompt for Clinical Summarization — Factual Fidelity]","chosen":"[Response that correctly satisfies: \"Does the summary accurately represent the source document without introducing false facts?\"]","rejected":"[Plausible response that fails: \"Does the summary accurately represent the source document without introducing false facts?\"]","anchors":["clinical-summarization.factual-fidelity.01"],"sources":["[Primary citable source for Factual Fidelity finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.factual-fidelity.01; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Factual Fidelity","tier":"escalate"}
{"prompt":"[Example prompt for Clinical Summarization — Contraindication Omission]","chosen":"[Response that correctly satisfies: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","rejected":"[Plausible response that fails: \"Are known contraindications, drug interactions, or safety warnings present and correctly stated?\"]","anchors":["clinical-summarization.contraindication-omission.02"],"sources":["[Primary citable source for Contraindication Omission finding]"],"rationale":"Chosen satisfies anchor clinical-summarization.contraindication-omission.02; rejected violates it in a way a domain expert can verify against the cited source.","dimension":"Contraindication Omission","tier":"escalate"}Invoice No. — MoEA vs. Human Labeling
Adjust the inputs; the receipt recalculates. At defaults (600 labels · $60/label · $0.30 synthetic · 10% gold) the defensible saving is ~9.5×. At $100/label the headline vs. DPO compute reaches ~167× — clearly labeled illustrative.
DPO compute ($8–$32 for 7B, ~$360 for 70B per Hugging Face estimates) is trivial and separate from labeling cost — the saving comes from expert review over only the audited fraction. Per-label rate reference: NextWealth / Lightly market surveys, 2024–2025.
The credibility artifact a technical buyer needs
Synthetic labeling without a validation backbone is a liability, not an asset. These four checks are the minimum for a regulated-industry pitch.
Agreement vs. gold set
Cohen's / Fleiss' κ and Krippendorff's α against a held-out expert set. Ship target: κ ≥ 0.60. Always report raw percentage agreement alongside kappa — kappa penalizes chance agreement in ways that matter here. Below 0.60: raise gold fraction toward 15–20%.
Judge-quality benchmarks
Run ensemble models through RewardBench 2, JudgeBench, and MT-Bench before deploying. A 3-family ensemble (e.g. Claude + Gemini + open-weight) ≈ PoLL (Verga et al., EMNLP 2024) and is 7–8× cheaper than a single GPT-4 judge while reducing intra-model bias.
Bias audit
Known biases and mitigations: Position bias (5–15%) → position-swap. Verbosity bias (10–20%) → length-normalization. Self-preference (10–25%) → cross-family judges. Without mitigations, a single-family ensemble amplifies these artifacts directly into the DPO data.
Downstream proof
Compare synthetic-labeled DPO model vs. human-labeled control on AlpacaEval 2 LC / Arena-Hard. Threshold: if the synthetic model underperforms the control by >3 points, raise gold fraction or narrow dimension scope before shipping.
What this approach cannot do
Overclaiming kills the pitch faster than modest numbers do. Lead with these.
- Expert ceiling
LLM judges agree with SMEs at 64–68% (Szymanski et al., IUI 2025). Synthetic-only labeling is unsafe for high-stakes expert judgment. Scope to screening + human escalation, not replacement.
Szymanski et al., "Is Your LLM a Good Evaluator?" IUI 2025 - Model collapse
Training on purely synthetic preference data risks progressive quality degradation (Shumailov et al., Nature 2024). Apply labels to human-authored or retrieval-grounded content; mix real data; never fully close the loop.
Shumailov et al., "AI models collapse when trained on recursively generated data", Nature 2024 - Reward hacking
Naive multi-model preference data can worsen DPO safety alignment ("More is Less", arXiv 2504.02193). A swarm is not automatically better than one well-benchmarked judge. Validate every new judge family before adding it to the ensemble.
arXiv 2504.02193, "More is Less: Scaling Multi-Agent DPO…" - Sycophancy
RLHF / DPO pipelines can amplify sycophancy — models that sound agreeable rather than accurate. Audit chosen/rejected pairs specifically for this pattern (Sharma et al. 2024).
Sharma et al., "Towards Understanding Sycophancy in Language Models", 2024 - Regulated domains
FDA SaMD guidance and legal-practice regulations require human-in-the-loop for consequential decisions. Pitch MoEA-labeler as pre-screening only in clinical/legal contexts — not as a replacement for licensed expert review.
FDA SaMD guidance; ABA Model Rules on competent supervision - Open evidence gap
No published study has measured a multi-LLM judge swarm closing the expert-agreement gap specifically in medicine or law at production scale. We say so. The gap between 68% synthetic ceiling and what regulated buyers need is the active research frontier.
State of field as of June 2026
Download the SKILL.md
Apache 2.0. Plug into Claude Code, Cursor, Copilot, or Gemini CLI to compose anchored deep-research prompts from a domain + labeling objective.
Given a domain and labeling objective, emits an XML deep-research prompt with LABEL-ANCHOR semantic anchors and the downstream JSONL labeling schema with validation gate. Deterministic — no API key required.
Domain + labeling objective ↓ [moea-labeler] ← Stage 1 (Claude Code) ↓ XML prompt w/ LABEL-ANCHORs → Gemini Deep Research ↓ Anchored, cited findings ↓ [preference-labeling handoff] → DPO-ready JSONL ← Stage 2 ↓ Validation gate (κ ≥ 0.6 vs human gold) → ship or escalate