Skills actually help. The numbers.
50 senior-engineering prompts. Four arms. Anthropic Sonnet 4.5 generating, Opus 4.5 judging blind with random position swaps. The results below are reproducible from this repo for ~$15.
The four arms
Judge sees the prompt and two responses anonymized as "Response 1" and "Response 2", with the position coin-flipped per pair. Scores five criteria: addresses the actual problem, correctness, respects conventions, avoids hallucinations, actionable.
A vs B — by criterion
Vanilla Sonnet (left, gray) vs Sonnet + grafted skills (right, lime). The gap on respects conventions is where the cascade earns its keep — stopping the model from suggesting OFFSET pagination, manual proration, or 30 useState hooks when the canonical pattern exists.
C vs D — by criterion
Vanilla "what next?" (left, gray) vs structured plan + grafted skills (right, coral). The structure-plus-skills version never lost on correctness — 19 wins, 0 losses, 31 ties. On respects-conventions it's nearly a sweep.
By category
5 prompts per category. B (or D) wins or ties every category. Stripe is 5–0 for both pairings. Auth is the only category where C and D split.
A vs B
C vs D
Sweeps — when one arm took ≥4/5 criteria
The judge's reasoning, verbatim, on the most lopsided wins. The pattern is consistent: without grafting, Sonnet picks the plausible-but-wrong answer. With grafting, it picks the answer the skill explicitly warns against the wrong one.
“Response 1 provides a more complete and technically accurate hybrid search implementation. It correctly uses Qdrant's native sparse vector support…”
“Response 2 conflates multiple issues and suggests DeferrableOperator for 5-hour compute tasks, which is a misapplication — deferrables are for waiting on external events, not CPU-bound work.”
“Response 2 suggests manually calculating proration, which reinvents the wheel, ignores that Stripe handles this automatically, and introduces potential bugs around edge cases that Stripe already solves.”
“Response 2's approach of splitting into 30 separate useState hooks is unconventional, harder to maintain, and doesn't solve the fundamental controlled-input problem — it just distributes it.”
“Response 1 contains a subtle error suggesting restartPolicy can restart 'only the failed sidecar' when Kubernetes restartPolicy applies at pod level, not per-container.”
“Response 2 contains a subtle hallucination — the incrementalCacheHandlerPath config doesn't enable incremental builds in the way described, and Cloudflare Pages doesn't persist build cache between deploys.”
Cascade hit-rate vs win-rate
The cascade hit the reference skill (or any acceptable in top-5) on 41 of 50 prompts. But here's the surprise: among the 9 misses, B still won 8 to 1.When the cascade picks a different-but-adjacent skill, it's still useful enough to beat vanilla. The reference label is one acceptable answer; the cascade is finding the neighborhood.
Reproduce it
~20 minutes wall clock. ~$15 in API costs. Anthropic + OpenAI keys in.env.local.
# 1. Build Tool2Vec cache (5 min, ~$0.20 — gpt-4o-mini + text-embedding-3-small) pnpm tsx scripts/bench/build-tool2vec-cache.ts --concurrency 10 # 2. Run + judge (15 min, ~$15 — Sonnet 4.5 generates, Opus 4.5 judges) pnpm tsx scripts/bench/bench.ts --concurrency 4 # Outputs: # bench/runs/<timestamp>/<prompt-id>/A.json B.json C.json D.json # bench/runs/<timestamp>/verdicts.json # bench/runs/<timestamp>/summary_judge.json
Dataset: scripts/bench/dataset.ts — patch in your own prompts to test grafting on your domain. The runner auto-builds the Tool2Vec cache on first run if it's missing.
This run: bench/runs/2026-04-29T05-47-32-655Z
Caveats
- ·n=50. Big enough for the headline (B's 74% is far outside random noise). Not big enough for confident category-level claims when n=5 per category.
- ·Judge is Opus. Same model family as Sonnet. Position swap and per-criterion scoring mitigate, but worth re-running with a different judge family.
- ·Catalog gaps. Stripe has one skill covering five prompts. B still won every Stripe prompt 5–0, which suggests one good skill beats none. A richer catalog would let us measure differentiation inside a category.
- ·Reference skills are opinions. We hand-picked “what a senior engineer would reach for.” They feed the cascade hit-rate metric only — the judge never sees them.
- ·Sonnet 4.5 only. The graft mechanism works on any model. Expect bigger lifts on weaker models, smaller on Opus.