ENGINEERING · BENCHMARKS

Skills actually help. The numbers.

50 senior-engineering prompts. Four arms. Anthropic Sonnet 4.5 generating, Opus 4.5 judging blind with random position swaps. The results below are reproducible from this repo for ~$15.

74%
B beats A
37–11–2 across 50 prompts. Vanilla Sonnet vs Sonnet + grafted skills.
88%
D beats C
44–4–2. Vanilla "what next?" vs structured plan-then-graft.
82%
Cascade hit-rate
41/50. Reference skill or any acceptable in top-5 of cascade.
$15
Total study cost
200 Sonnet generations + 100 Opus pair judgments. ~20 min wall clock.

The four arms

A — vanilla Sonnet
User prompt. No system prompt. No grafted skills. The control.
B — Sonnet + grafted skills
Cascade picks top-2 skills for the prompt. Full SKILL.md bodies prepended as system prompt.
C — vanilla "/next-move" framing
Prompt wrapped in "what specifically should I work on next?". Still no skills. Tests whether the framing alone moves the needle.
D — plan-then-graft
C's framing + B's grafted skills + structured "identify problem class → name specific change → flag risks" scaffold.

Judge sees the prompt and two responses anonymized as "Response 1" and "Response 2", with the position coin-flipped per pair. Scores five criteria: addresses the actual problem, correctness, respects conventions, avoids hallucinations, actionable.

A vs B — by criterion

Vanilla Sonnet (left, gray) vs Sonnet + grafted skills (right, lime). The gap on respects conventions is where the cascade earns its keep — stopping the model from suggesting OFFSET pagination, manual proration, or 30 useState hooks when the canonical pattern exists.

Respects conventions
A 4
tie 8
B 38
Actionable
A 12
tie 4
B 34
Correctness
A 4
tie 23
B 23
Addresses actual problem
A 7
tie 25
B 18
Avoids hallucinations
A 2
tie 45
B 3

C vs D — by criterion

Vanilla "what next?" (left, gray) vs structured plan + grafted skills (right, coral). The structure-plus-skills version never lost on correctness — 19 wins, 0 losses, 31 ties. On respects-conventions it's nearly a sweep.

Respects conventions
C 3
tie 2
D 45
Actionable
C 5
tie 9
D 36
Correctness
tie 31
D 19
Addresses actual problem
C 3
tie 35
D 12
Avoids hallucinations
tie 46
D 4

By category

5 prompts per category. B (or D) wins or ties every category. Stripe is 5–0 for both pairings. Auth is the only category where C and D split.

A vs B

stripe-payments
B 5
observability
B 4
auth-oauth
A 1
B 4
data-pipelines
A 1
B 4
frontend
A 1
B 4
ml-pipelines
A 1
B 4
graphql-rest-apis
A 1
B 3
build-deploy
A 2
B 3
k8s-ops
A 2
B 3
postgres-perf
A 2
B 3

C vs D

stripe-payments
D 5
build-deploy
D 5
frontend
D 5
k8s-ops
D 5
observability
D 5
postgres-perf
D 5
graphql-rest-apis
D 4
data-pipelines
C 1
D 4
ml-pipelines
C 1
D 4
auth-oauth
C 2
D 2

Sweeps — when one arm took ≥4/5 criteria

The judge's reasoning, verbatim, on the most lopsided wins. The pattern is consistent: without grafting, Sonnet picks the plausible-but-wrong answer. With grafting, it picks the answer the skill explicitly warns against the wrong one.

B swept 5/5ml-002· ml-pipelines

Response 1 provides a more complete and technically accurate hybrid search implementation. It correctly uses Qdrant's native sparse vector support…

D swept 5/5data-003· data-pipelines

Response 2 conflates multiple issues and suggests DeferrableOperator for 5-hour compute tasks, which is a misapplication — deferrables are for waiting on external events, not CPU-bound work.

D swept 4/5pay-002· stripe-payments

Response 2 suggests manually calculating proration, which reinvents the wheel, ignores that Stripe handles this automatically, and introduces potential bugs around edge cases that Stripe already solves.

D swept 4/5fe-001· frontend

Response 2's approach of splitting into 30 separate useState hooks is unconventional, harder to maintain, and doesn't solve the fundamental controlled-input problem — it just distributes it.

D swept 4/5k8s-004· k8s-ops

Response 1 contains a subtle error suggesting restartPolicy can restart 'only the failed sidecar' when Kubernetes restartPolicy applies at pod level, not per-container.

B swept 4/5build-004· build-deploy

Response 2 contains a subtle hallucination — the incrementalCacheHandlerPath config doesn't enable incremental builds in the way described, and Cloudflare Pages doesn't persist build cache between deploys.

Cascade hit-rate vs win-rate

The cascade hit the reference skill (or any acceptable in top-5) on 41 of 50 prompts. But here's the surprise: among the 9 misses, B still won 8 to 1.When the cascade picks a different-but-adjacent skill, it's still useful enough to beat vanilla. The reference label is one acceptable answer; the cascade is finding the neighborhood.

Cascade hits (41)
B wins 71%
A 10
tie 2
B 29
Cascade misses (9)
B wins 89%
A 1
B 8

Reproduce it

~20 minutes wall clock. ~$15 in API costs. Anthropic + OpenAI keys in.env.local.

# 1. Build Tool2Vec cache (5 min, ~$0.20 — gpt-4o-mini + text-embedding-3-small)
pnpm tsx scripts/bench/build-tool2vec-cache.ts --concurrency 10

# 2. Run + judge (15 min, ~$15 — Sonnet 4.5 generates, Opus 4.5 judges)
pnpm tsx scripts/bench/bench.ts --concurrency 4

# Outputs:
#   bench/runs/<timestamp>/<prompt-id>/A.json B.json C.json D.json
#   bench/runs/<timestamp>/verdicts.json
#   bench/runs/<timestamp>/summary_judge.json

Dataset: scripts/bench/dataset.ts — patch in your own prompts to test grafting on your domain. The runner auto-builds the Tool2Vec cache on first run if it's missing.

This run: bench/runs/2026-04-29T05-47-32-655Z

Caveats

  • ·n=50. Big enough for the headline (B's 74% is far outside random noise). Not big enough for confident category-level claims when n=5 per category.
  • ·Judge is Opus. Same model family as Sonnet. Position swap and per-criterion scoring mitigate, but worth re-running with a different judge family.
  • ·Catalog gaps. Stripe has one skill covering five prompts. B still won every Stripe prompt 5–0, which suggests one good skill beats none. A richer catalog would let us measure differentiation inside a category.
  • ·Reference skills are opinions. We hand-picked “what a senior engineer would reach for.” They feed the cascade hit-rate metric only — the judge never sees them.
  • ·Sonnet 4.5 only. The graft mechanism works on any model. Expect bigger lifts on weaker models, smaller on Opus.