Skill Quality · Part 3

Vanilla Sonnet vs Sonnet + grafted skills: B wins 74%. Structured planning: D wins 88%.

benchmarksskillsevaluationretrievalanthropic

Skills Actually Help: The Numbers

We ran 50 senior-engineer prompts through vanilla Sonnet and Sonnet-with-grafted-skills, then had Opus judge the answers blind. Skill grafting won 74% of decided pairs. Structured planning won 88%. Here's the full study, the methodology, and the categories where it made the biggest difference.

Tuesday, April 28, 2026

A reasonable question: does any of this matter?

We've written a lot about the cascade that finds the right skill, the graft mechanism that injects it into the agent prompt, and the feedback loops that sharpen which skills get reached for. The premise the whole product rests on is that doing this is materially better than just calling Sonnet with the user's prompt unchanged.

So we ran the study. 50 hand-curated senior-engineering prompts across 10 categories — GraphQL pagination, Postgres failover, Stripe dunning, k8s sidecar patterns, RSC hydration, embedding pipelines, Airflow scheduler pressure, the kind of thing where there's a textbook answer and a reinvented-the-wheel answer. Four arms. Anthropic Sonnet 4.5 generating, Anthropic Opus 4.5 judging blind with random position swaps.

Headline: B beats A 37–11 (74% of decided pairs). D beats C 44–4 (88%). The grafted-skill arms win every category. Total study cost: ~$15.

The four arms

Arm	What it is
A	Vanilla Sonnet. The user's prompt, no system prompt, no grafted skills.
B	Sonnet + grafted skills. The cascade picks the top-2 skills for the prompt; the full SKILL.md bodies are prepended as a system prompt.
C	Vanilla "/next-move" framing. Same prompt, but wrapped in "what specifically should I work on next?" framing — no skills.
D	Structured plan-then-graft. Same wrapping as C, plus the cascade's top-2 skills grafted in.

A vs B isolates the value of skill grafting. C vs D isolates the value of skill grafting under structured planning. The judge sees the prompt and two responses anonymized as "Response 1" and "Response 2", with the position coin-flipped per pair to neutralize position bias. The rubric scores five criteria: addresses the actual problem, correctness, respects conventions, avoids hallucinations, actionable.

The skill catalog is the 503 SKILL.md files in the repo. The cascade is BM25 → Tool2Vec → RRF fusion → cross-encoder rerank → attribution k-NN. The Tool2Vec stage was actually firing in this run, on a fresh 17.95 MB cache covering all 547 skills.

A vs B: skills win 74% of decided pairs

	A (vanilla)	B (grafted)	tie
Overall	11	37	2

Out of 48 decided pairs, B took 37. Per-criterion is where the picture sharpens:

Criterion	A	B	tie
Respects conventions	4	38	8
Actionable	12	34	4
Correctness	4	23	23
Addresses actual problem	7	18	25
Avoids hallucinations	2	3	45

The biggest delta is respects conventions (B wins 38 to 4). Skill grafting's strongest effect is preventing the model from reaching for a generic-but-wrong solution when an industry-standard one exists. Sonnet 4.5 alone is capable of suggesting cursor pagination — it just doesn't always reach for it. Graft graphql-server-architect into the system prompt and it does.

The smallest delta is avoids hallucinations (B 3, A 2, tie 45). This is honest: Sonnet 4.5 doesn't hallucinate much in either condition. If you came here looking for a "skills cure hallucinations" story, the data doesn't support it — what they cure is generic answers when a specific one is available.

C vs D: structured planning + skills wins 88%

	C (vanilla "what next?")	D (plan + graft)	tie
Overall	4	44	2

D wins every category. On respects_conventions it's almost a sweep — D 45, C 3, tie 2. On correctness D never lost a decision: 19 wins, 0 losses, 31 ties.

The interesting comparison isn't D vs C — it's D vs B. Adding the structured "identify problem class → name specific change → flag the risk a senior engineer would" scaffold to the same grafted skills made the win rate jump from 74% to 88%. Skills give you the right knowledge; the structured prompt gives the model a place to put it.

This is the proxy we use for /next-move at full strength. The full meta-DAG (sensemaker → decomposer → skill-selector → premortem → synthesizer) is more expensive to run as a benchmark, so D is the cheap-stand-in version. If the cheap version already wins 88%, the full pipeline isn't going to do worse.

What the judge actually said

Verbatim from the judge's reasoning on the prompts where one arm took ≥4/5 criteria:

pay-002 (subscription proration) — Response 1 correctly leverages Stripe's built-in proration handling via stripe.subscriptions.update() with proration_behavior, which is the standard approach for payment systems. Response 2 suggests manually calculating proration, which reinvents the wheel, ignores that Stripe handles this automatically, and introduces potential bugs around edge cases that Stripe already solves.

data-003 (Airflow scheduler pressure) — Response 1 directly addresses the core issue of long-running tasks holding worker slots with a concrete, actionable fix (KubernetesPodOperator/pools). Response 2 conflates multiple issues and suggests DeferrableOperator for 5-hour compute tasks, which is a misapplication — deferrables are for waiting on external events, not CPU-bound work.

fe-001 (React form re-renders) — Response 1 directly addresses the root cause (controlled inputs causing full tree re-renders) with the industry-standard solution (react-hook-form + uncontrolled inputs)... Response 2's approach of splitting into 30 separate useState hooks is unconventional, harder to maintain, and doesn't solve the fundamental controlled-input problem — it just distributes it.

k8s-004 (sidecar pattern) — Response 1 contains a subtle error suggesting restartPolicy can restart "only the failed sidecar" when Kubernetes restartPolicy applies at pod level, not per-container.

The pattern: the grafted-skill arms catch named failure modes the skills explicitly warn against. mobile-payment-integration-specialist says "use Stripe's built-in proration." airflow-dag-orchestrator says "DeferrableOperator is for I/O waits, not CPU." react-performance-optimizer says "react-hook-form, not 30 useStates." Without the graft, Sonnet picks the plausible-sounding wrong answer about a third of the time. With the graft, it picks the right one.

Where the cascade missed and why it didn't matter much

Cascade hit-rate (reference skill or any acceptable in top-5) was 82% — 41 of 50. Not 100%. Some of the misses are catalog gaps (we don't have a "Stripe webhook idempotency" skill specifically, just one general payments skill). Some are taxonomy disagreements between us and the judge.

But here's the thing: among the 9 cascade misses, B still won 8 to 1. When the cascade picks a "different but adjacent" skill, it's still useful enough to beat vanilla. The reference label is one acceptable answer; the cascade is finding the neighborhood, not threading a needle.

This matters for productizing. We don't need 100% hit rate to ship. We need the cascade to be in the right neighborhood, and the grafted skill to bring conventions and concrete file paths the model wouldn't otherwise reach for.

Caveats. The honest version.

n=50. Big enough to see real differences (B's 74% win rate is far outside random noise). Not big enough to make confident category-level claims, especially in categories where one arm only got 5 chances.
Judge is Opus. Opus and Sonnet are from the same family — there's some risk the judge prefers Sonnet-flavored prose. We mitigated with position swap and per-criterion scoring (the model has to commit to why B won, not just that B won). Worth re-running with a different judge family.
Catalog gaps. Stripe has one skill covering five prompts. Same with k8s. The B arm still won every Stripe prompt 5–0, which suggests even one good skill beats none — but a richer catalog would let us measure differentiation across topics inside a category.
Reference skills. We picked them by hand. They're "what a senior engineer would reach for" — opinions, not ground truth. The judge doesn't see them; they only feed the cascade hit-rate metric.
Sonnet 4.5 is the only model under test. The graft mechanism works on any model that takes a system prompt. We expect bigger lifts on weaker models and smaller lifts on Opus.

Reproduce it

Everything is in the repo. Run:

pnpm tsx scripts/bench/build-tool2vec-cache.ts --concurrency 10
pnpm tsx scripts/bench/bench.ts --concurrency 4

The first command builds the Tool2Vec cache (5 min, ~$0.20 via gpt-4o-mini for synthetic tasks + text-embedding-3-small for embeddings). The second runs all 50 prompts × 4 arms via Anthropic, then judges 100 pairs via Opus, and writes verdicts.json + summary_judge.json. End-to-end: ~20 minutes, ~$15.

You'll need ANTHROPIC_API_KEY and OPENAI_API_KEY in .env.local. The dataset is in scripts/bench/dataset.ts — patch in your own prompts to test grafting on your domain.

The full per-prompt breakdown lives at /engineering/benchmarks.

What this changes for us

This is the first study where we have a number we trust. 74% is the headline we'd give a hiring manager. The fact that it gets to 88% with structured planning is the headline we'd give a customer.

The next study — the one we're building toward — replaces the hand-curated 50 prompts with the actual .windags/triples of real users (private-first; opt-in to share). Studying ourselves is a good first step. Studying the things people actually ask is the one that matters.

Discussion

Scroll down to load comments

Back to all posts