ENGINEERING · BENCHMARKS

Skills actually help. The numbers.

50 senior-engineering prompts. Four arms. Anthropic Sonnet 4.5 generating, Opus 4.5 judging blind with random position swaps. The results below are reproducible from this repo for ~$15.

Read the writeup scripts/bench

74%

B beats A

37–11–2 across 50 prompts. Vanilla Sonnet vs Sonnet + grafted skills.

88%

D beats C

44–4–2. Vanilla "what next?" vs structured plan-then-graft.

82%

Cascade hit-rate

41/50. Reference skill or any acceptable in top-5 of cascade.

$15

Total study cost

200 Sonnet generations + 100 Opus pair judgments. ~20 min wall clock.

The four arms

A — vanilla Sonnet

User prompt. No system prompt. No grafted skills. The control.

B — Sonnet + grafted skills

Cascade picks top-2 skills for the prompt. Full SKILL.md bodies prepended as system prompt.

C — vanilla "/next-move" framing

Prompt wrapped in "what specifically should I work on next?". Still no skills. Tests whether the framing alone moves the needle.

D — plan-then-graft

C's framing + B's grafted skills + structured "identify problem class → name specific change → flag risks" scaffold.

Judge sees the prompt and two responses anonymized as "Response 1" and "Response 2", with the position coin-flipped per pair. Scores five criteria: addresses the actual problem, correctness, respects conventions, avoids hallucinations, actionable.

A vs B — by criterion

Vanilla Sonnet (left, gray) vs Sonnet + grafted skills (right, lime). The gap on respects conventions is where the cascade earns its keep — stopping the model from suggesting OFFSET pagination, manual proration, or 30 useState hooks when the canonical pattern exists.

Respects conventions

A 4

tie 8

B 38

Actionable

A 12

tie 4

B 34

Correctness

A 4

tie 23

B 23

Addresses actual problem

A 7

tie 25

B 18

Avoids hallucinations

A 2

tie 45

B 3

C vs D — by criterion

Vanilla "what next?" (left, gray) vs structured plan + grafted skills (right, coral). The structure-plus-skills version never lost on correctness — 19 wins, 0 losses, 31 ties. On respects-conventions it's nearly a sweep.

Respects conventions

C 3

tie 2

D 45

Actionable

C 5

tie 9

D 36

Correctness

tie 31

D 19

Addresses actual problem

C 3

tie 35

D 12

Avoids hallucinations

tie 46

D 4

By category

5 prompts per category. B (or D) wins or ties every category. Stripe is 5–0 for both pairings. Auth is the only category where C and D split.

A vs B

stripe-payments

B 5

observability

B 4

auth-oauth

A 1

B 4

data-pipelines

A 1

B 4

frontend

A 1

B 4

ml-pipelines

A 1

B 4

graphql-rest-apis

A 1

B 3

build-deploy

A 2

B 3

k8s-ops

A 2

B 3

postgres-perf

A 2

B 3

C vs D

stripe-payments

D 5

build-deploy

D 5

frontend

D 5

k8s-ops

D 5

observability

D 5

postgres-perf

D 5

graphql-rest-apis

D 4

data-pipelines

C 1

D 4

ml-pipelines

C 1

D 4

auth-oauth

C 2

D 2

Sweeps — when one arm took ≥4/5 criteria

The judge's reasoning, verbatim, on the most lopsided wins. The pattern is consistent: without grafting, Sonnet picks the plausible-but-wrong answer. With grafting, it picks the answer the skill explicitly warns against the wrong one.

B swept 5/5ml-002· ml-pipelines

“Response 1 provides a more complete and technically accurate hybrid search implementation. It correctly uses Qdrant's native sparse vector support…”

D swept 5/5data-003· data-pipelines

“Response 2 conflates multiple issues and suggests DeferrableOperator for 5-hour compute tasks, which is a misapplication — deferrables are for waiting on external events, not CPU-bound work.”

D swept 4/5pay-002· stripe-payments

“Response 2 suggests manually calculating proration, which reinvents the wheel, ignores that Stripe handles this automatically, and introduces potential bugs around edge cases that Stripe already solves.”

D swept 4/5fe-001· frontend

“Response 2's approach of splitting into 30 separate useState hooks is unconventional, harder to maintain, and doesn't solve the fundamental controlled-input problem — it just distributes it.”

D swept 4/5k8s-004· k8s-ops

“Response 1 contains a subtle error suggesting restartPolicy can restart 'only the failed sidecar' when Kubernetes restartPolicy applies at pod level, not per-container.”

B swept 4/5build-004· build-deploy

“Response 2 contains a subtle hallucination — the incrementalCacheHandlerPath config doesn't enable incremental builds in the way described, and Cloudflare Pages doesn't persist build cache between deploys.”

Cascade hit-rate vs win-rate

The cascade hit the reference skill (or any acceptable in top-5) on 41 of 50 prompts. But here's the surprise: among the 9 misses, B still won 8 to 1.When the cascade picks a different-but-adjacent skill, it's still useful enough to beat vanilla. The reference label is one acceptable answer; the cascade is finding the neighborhood.

Cascade hits (41)

B wins 71%

A 10

tie 2

B 29

Cascade misses (9)

B wins 89%

A 1

B 8

Reproduce it

~20 minutes wall clock. ~$15 in API costs. Anthropic + OpenAI keys in.env.local.

# 1. Build Tool2Vec cache (5 min, ~$0.20 — gpt-4o-mini + text-embedding-3-small)
pnpm tsx scripts/bench/build-tool2vec-cache.ts --concurrency 10

# 2. Run + judge (15 min, ~$15 — Sonnet 4.5 generates, Opus 4.5 judges)
pnpm tsx scripts/bench/bench.ts --concurrency 4

# Outputs:
#   bench/runs/<timestamp>/<prompt-id>/A.json B.json C.json D.json
#   bench/runs/<timestamp>/verdicts.json
#   bench/runs/<timestamp>/summary_judge.json

Dataset: scripts/bench/dataset.ts — patch in your own prompts to test grafting on your domain. The runner auto-builds the Tool2Vec cache on first run if it's missing.

This run: bench/runs/2026-04-29T05-47-32-655Z

Caveats

·n=50. Big enough for the headline (B's 74% is far outside random noise). Not big enough for confident category-level claims when n=5 per category.
·Judge is Opus. Same model family as Sonnet. Position swap and per-criterion scoring mitigate, but worth re-running with a different judge family.
·Catalog gaps. Stripe has one skill covering five prompts. B still won every Stripe prompt 5–0, which suggests one good skill beats none. A richer catalog would let us measure differentiation inside a category.
·Reference skills are opinions. We hand-picked “what a senior engineer would reach for.” They feed the cascade hit-rate metric only — the judge never sees them.
·Sonnet 4.5 only. The graft mechanism works on any model. Expect bigger lifts on weaker models, smaller on Opus.

← Back to engineering surface·Read the writeup·How the cascade works