Skill Compression · Part 2

Less is more — especially when Claude already knows.

skillscompressionembeddingsevaluationrate-distortiontooling

WinDAGSZip — Compressing Skills with Embeddings

A 23MB embedding model finds 25% of tokens are self-duplicating in code-heavy skills — for free, no API calls. An LLM judge finds another 20-40% overlaps with training data. We built the tool, measured everything across 10 skills, and shipped it.

In The 191-Skill Quality Pass, we graded every skill in our library against a 10-axis rubric and fixed the universal gap (163 skills were missing output contracts). That pass made skills better. This one makes them smaller.

Every token in a skill prompt costs money and context window. But how much of a skill is actually doing work? How much is the model repeating back to itself — code examples it's seen a thousand times, best practices baked into its training data, boilerplate that appears in three different sections?

We built a tool to answer that. It uses a 23MB embedding model to detect intra-skill duplication, an LLM judge to detect pretraining overlap, and rate-distortion theory to find the optimal compression point. We ran it across 10 skills totaling 61,408 tokens. Here's what we found.

The punchline: skills have two kinds of fat, and they need different tools to find. One is free. The other costs about a penny per test case. Together they cut 25-46% of tokens with zero measured quality loss.

UMAP embedding projections of four skills — code-heavy skills show dense colored clusters of redundant chunks, knowledge-dense skills show scattered gray dots with minimal clustering.
What skills look like in embedding space — each colored cluster is redundant text we can safely cut. Jump to interactive version ↓
Code-heavy skills
Ship lots of code examples. The same patterns show up in multiple snippets — similar imports, boilerplate, repeated config blocks.
Compress with: A 23MB embedding model. Free. No API calls. Runs in ~20 seconds. Savings: 20-46%.
Knowledge-dense skills
Teach domain expertise — concepts, best practices, decision frameworks. Low self-duplication, but much overlaps with Claude's training data.
Compress with: A graded eval that ablates sections and checks if quality drops. Cost: ~$0.01/test case.

How We Measure "Quality"

Before going further: compression is only useful if quality holds. Here's the eval rubric we use throughout this post.

The Two-Phase LLM Judge

For each (skill variant, test case) pair, we run a two-phase evaluation:

Phase Model Role Input Output
Executor Sonnet Generate a response Skill text injected via <skill> tag + query from test suite A skill-informed response (no tool use allowed)
Grader Haiku Score the response Executor's response + expected_behavior from test suite {pass, score, evidence}
Ablated Skill + Query
Executor (Sonnet)
Grader (Haiku)
{pass, score, evidence}

The grader is strict. A response that touches on the topic but lacks specific depth from the skill gets a FAIL — generic advice that any AI could produce without the skill doesn't count. This matters because we're testing whether the skill adds value beyond Claude's pretraining, not whether Claude can answer the question at all.

What Defines "Expected Behavior"

Each test case in the eval suite specifies an expected_behavior string written by the skill author. These are the assertions the grader evaluates against. For example:

Query: "What's the similarity threshold I should use for catching duplicates in my skill?"

Expected behavior: "Explains the default 0.70 threshold, recommends lowering to 0.60 for aggressive deduplication with warning about false positives, and suggests validating with graded eval after any threshold change"

The grader must find evidence of each specific claim in the executor's response. Surface-level mentions don't count.

Interpreting Scores

Metric Meaning
Score (0.0-1.0) Continuous quality rating per test case
Pass/Fail Binary — did the response meet the expected behavior?
Score drop (variant vs baseline) Quality delta from removing a chunk
Score drop < 0.05 Within measurement noise — safe to compress
Score drop > 0.10 Real quality loss — keep the chunk
Negative score drop Quality improved — the chunk was hurting performance

Relationship to Anthropic's Skill-Creator Eval

Anthropic's official skill-creator uses a similar architecture: executor generates output, grader agent scores against structured expectations, and an aggregator computes benchmark statistics. The key differences:

Anthropic skill-creator Our eval (eval_judge.py)
Grading Multiple named assertions per case, each pass/fail Single expected_behavior string, scalar score
Comparison With-skill vs without-skill (or old vs new) Baseline vs ablated variant
Extra features Blind comparator, benchmark viewer, analyst agent R-D curve computation, chunk importance ranking
Purpose "Is this skill good?" "Does removing this chunk change quality?"

Our eval is a focused subset designed for compression testing. The skill-creator eval answers "is the skill useful?" Our eval answers "is this chunk useful?" Both use the same core insight: an LLM grader judging executor output against human-authored expectations.

Cost

Component Cost per test case
Executor (sonnet) ~$0.010
Grader (haiku) ~$0.001
Total ~$0.011

For windagszip (14 positive test cases, 6 variants): ~$0.92 total. For a full 10-skill audit: ~$60.

Measurement Noise

LLM-as-judge has ±0.15 variance per test case. The same case scored 0.48 in one run and 0.83 in another. Aggregate across 10+ cases for stable signal. Score drops under ±0.05 are noise.

The Problem: Skills Are Expensive Real Estate

Every skill injected into an agent's prompt has three costs: money (input tokens billed per million), attention (the model must parse all of it), and latency (more tokens = slower time-to-first-token). Our skill library ranges from 795 tokens (code-review-checklist) to 14,245 tokens (mcp-creator). When a WinDAGs workflow injects 3-5 skills per agent across 5 waves of parallel execution, token budgets matter.

The question is simple: how much of each skill is actually load-bearing?

Two Types of Fat

We found that skill redundancy comes in two distinct flavors, and they require completely different tools to detect.

Intra-Skill Redundancy

Chunk A says the same thing as chunk B within the same skill. Think of a CSS tutorial that shows position: absolute; inset: 0 in eleven different code blocks — aurora containers, atmosphere layers, fog containers, cloud layers, rain overlays. Each block demonstrates the technique in a slightly different context, but the embedding model sees them as near-identical.

Detection: Embed every chunk with all-MiniLM-L6-v2 (384 dimensions, 23MB ONNX quantized). Compute pairwise cosine similarity. Cluster connected components above 0.70. Within each cluster, keep the canonical version (highest token count) and mark the rest as redundant.

Cost: Zero. The model runs locally. No API calls.

Pretraining Overlap

A chunk teaches something Claude already knows from training data. The PBR shading reference file in metal-shader-expert walks through Physically Based Rendering concepts that Claude learned from thousands of graphics programming resources during training. The reference confirms what Claude knows rather than teaching anything new.

Detection: Remove the chunk and run the LLM-judged eval. If the score doesn't drop, the chunk was pretraining overlap. This requires an executor model (sonnet) to generate a response using the ablated skill, and a grader model (haiku) to score against expected behavior.

Cost: ~$0.011 per test case. About $0.15-0.60 per skill depending on test suite size.

Chunk Similarity Matrix for web-weather-creator — top 20 chunks by token count. Red borders indicate pairs above the 0.70 clustering threshold.
web-weather-creator similarity matrix. Red borders = chunk pairs above 0.70 threshold.

Here's what intra-skill redundancy actually looks like. This is the similarity matrix for web-weather-creator — every red-bordered cell is a pair of chunks saying nearly the same thing.

The diagonal is self-similarity (always 1.0). The off-diagonal red-bordered cells are the interesting ones — chunk pairs that say the same thing. One comprehensive reference file subsumes eleven scattered code examples. The embedding model sees through surface variation to shared semantic structure.

These two types explain a clean bifurcation in the data.

The Embedding Pipeline

Chunking

We reuse the semantic chunker from the quality pass, which identifies 12 chunk types:

Type Ablatable Notes
FRONTMATTER_FIELD Some name/description: never remove
SECTION / SUBSECTION No Structural anchors
COMPOUND Yes Paragraph + code block bonded together
PARAGRAPH Yes Prose content
LIST_BLOCK Yes Complete list (ablated as unit)
CODE_BLOCK Yes Code examples
REFERENCE Yes External reference files (often >1,000 tokens)
MERMAID Yes Diagram blocks

Compound units matter: a paragraph that introduces a code block ("Here's how fog works:") must be ablated with its code block. Without this rule, you'd remove the paragraph and orphan the code block — or vice versa.

Similarity Matrix

Every ablatable chunk gets embedded into a 384-dimensional vector. We L2-normalize all embeddings, compute cosine similarity via matrix multiplication, and apply a degenerate-embedding guard (norm < epsilon → skip) to avoid NaN from empty/tiny chunks.

The result is an N×N similarity matrix where each cell is the cosine similarity between two chunks — shown above for web-weather-creator.

Clustering

We build an adjacency graph: edge between chunks when similarity > 0.70. Connected components via BFS. Within each component, the chunk with the highest token count is the canonical version. Everything else is redundant.

Cluster 1 (avg sim: 0.847, redundant tokens: 2,127)
  KEEP [reference   ] 1821tk  Full CSS layering reference...
  CUT  [code_block  ]  312tk  .aurora-container { position: absol...
  CUT  [code_block  ]  287tk  .atmosphere-layer { position: absol...
  CUT  [compound    ]  198tk  Here's how fog works: ...
  ...

Results Across 10 Skills

Embedding-detected intra-skill redundancy across all 10 specimen skills. The two code-heavy skills (web-weather-creator and modern-auth-2026) show >20% self-duplication, while the remaining 8 fall below 5%.

The 10 Specimen Skills

We selected 10 skills spanning both compression regimes — from a 14,245-token MCP construction manual to a 795-token checklist generator. Here's what each one does and what the embeddings found inside.

Stylized atmospheric weather effects — aurora, fog, rain, clouds — using CSS animations, SVG filters, and layered compositions.
25.2% redundant · 7 clustersCODE-HEAVY
Modern authentication patterns — passkeys, WebAuthn, OAuth 2.0, session management, and credential storage.
24.4% redundant · 13 clustersCODE-HEAVY
Advanced TypeScript — branded types, discriminated unions, conditional types, Zod schema inference, exhaustive checking.
4.2% redundant · 5 clustersKNOWLEDGE-DENSE
Apple Metal GPU shaders — PBR rendering, noise effects, debug tools, GPU Family 9 specifics, half vs float precision.
2.4% redundant · 3 clustersKNOWLEDGE-DENSE
mcp-creator 14,245 tk
Building Model Context Protocol servers — tools, resources, prompts, transports, testing, and deployment patterns.
0.4% redundant · 2 clustersKNOWLEDGE-DENSE
Security vulnerability scanning — OWASP top 10, dependency analysis, secrets detection, code hardening patterns.
0.3% redundant · 3 clustersKNOWLEDGE-DENSE
Text-based diagrams — Mermaid, PlantUML, and ASCII art for technical documentation and architecture visuals.
0.8% redundant · 2 clustersKNOWLEDGE-DENSE
Prompt optimization — chain-of-thought, few-shot examples, structured outputs, temperature tuning, evaluation.
0.2% redundant · 1 clusterKNOWLEDGE-DENSE
Strategic competitive analysis — market mapping, positioning matrices, moat identification, and threat assessment.
0.1% redundant · 1 clusterKNOWLEDGE-DENSE
Context-aware code review checklists — security, performance, maintainability, test coverage, accessibility.
0.0% redundant · 0 clustersKNOWLEDGE-DENSE

The data splits cleanly into two regimes:

Skill Tokens Clusters Redundant %
web-weather-creator 13,009 7 3,276 25.2%
modern-auth-2026 7,114 13 1,737 24.4%
typescript-adv-patterns 8,753 5 364 4.2%
metal-shader-expert 4,787 3 113 2.4%
diagramming-expert 2,180 2 18 0.8%
mcp-creator 14,245 2 55 0.4%
security-auditor 4,852 3 15 0.3%
prompt-engineer 1,438 1 3 0.2%
competitive-cartographer 4,235 1 3 0.1%
code-review-checklist 795 0 0 0.0%

Code-heavy skills (web-weather-creator, modern-auth-2026) have 20-25% intra-skill redundancy. These are skills stuffed with code examples showing variations of the same pattern. Compression is free — embeddings alone recover thousands of tokens at zero API cost.

Knowledge-dense skills (the other 8) have less than 5% intra-skill redundancy. Each chunk is distinct from every other chunk within the skill. But the graded eval showed these skills still compress 30-46%. Where does that compression come from? Pretraining overlap. The chunks duplicate what Claude already knows, not what other chunks in the skill already say.

What This Looks Like in Embedding Space

The 2D UMAP projection below shows what redundancy looks like when you flatten 384-dimensional chunk embeddings into two dimensions. Each dot is a chunk from the skill. Nearby dots are semantically similar. Colored clusters are the redundancy groups — within each, the green-ringed dot is the canonical keeper, and the colored dots are cuttable duplicates.

Interactive: hover for a preview, click any dot for the full chunk text. Top row shows code-heavy skills (25% redundant); bottom row shows knowledge-dense skills (<5% redundant).

The contrast is immediate. web-weather-creator's plot is dominated by overlapping colored clusters — seven groups of near-identical CSS code. metal-shader-expert is scattered gray dots with barely any clustering. The embedding model sees what the similarity matrix confirmed: code-heavy skills repeat themselves, knowledge-dense skills don't.

web-weather-creator Deep Dive

Seven clusters. 3,276 redundant tokens (25.2%).

The biggest cluster (2,127 tokens) tells a clean story: one reference file on "Layering Strategies" comprehensively covers the CSS positioning pattern. But eleven orphaned code blocks throughout the skill demonstrate the same technique in context — aurora containers, atmosphere layers, fog effects, cloud systems. The embedding model sees them as semantically identical (average similarity: 0.847). The reference file is the canonical version; the eleven blocks are fragments.

Cluster 2 (698 tokens): four SVG filter code blocks (wave, beach, fog, rain) all use <filter><feTurbulence><feDisplacementMap>. One compound example subsumes the other three.

modern-auth-2026 Deep Dive

Thirteen clusters. 1,737 redundant tokens (24.4%).

The dominant cluster (1,458 tokens) shows seven authentication code blocks all sharing similar patterns with a canonical passkey implementation. The code looks different enough to a human reader — different variable names, different error handling — but the embedding model sees through the surface variation to the shared semantic structure.

The LLM-Judged Pipeline

Embeddings handle intra-skill duplication. But 8 of 10 skills have less than 5% of that. To find pretraining overlap, we need the LLM.

Our initial approach was random ablation — remove one chunk at a time, run the full eval suite, measure the quality delta. This works conceptually but scales terribly: 240 API calls per skill ($1.32), ~28 minutes of wall time, and results within ±0.15 measurement noise. Across 10 skills, that's 2,880 calls, ~$13, and ~24 hours. The semantic embedding pass does the same job for $0 in 20 seconds, so we use it as the first pass and reserve the LLM judge for the targeted second pass — only evaluating chunks that survived embedding dedup.

We built eval_judge.py: a two-phase LLM-judged pipeline.

Phase 1 — Executor: Inject the ablated skill text into a sonnet prompt. Ask the model to demonstrate skill-specific knowledge. The model must answer from the injected skill alone — no tool use allowed.

Phase 2 — Grader: A haiku model evaluates the response against expected behavior assertions. Each assertion gets PASS/FAIL, and the grader produces a continuous 0-1 quality score. The grader is strict: generic advice that any AI could produce without the skill gets a FAIL, even if it's technically correct.

This grader design matters. We're not testing whether Claude can review code (it can). We're testing whether the code-review-checklist skill adds value beyond what Claude already knows. If removing a chunk causes no quality drop, the chunk was either genuinely redundant or covered by training data. Both mean it's safe to remove.

Rate-Distortion Theory

We frame compression as a rate-distortion problem. Rate is tokens consumed in the context window. Distortion is quality drop from compression. The R(D) curve traces the Pareto frontier: minimum tokens for a given quality level. The knee of the curve is the optimal compression point.

Rate-distortion curves from graded eval. Left: code-review-checklist shows quality IMPROVING as tokens are removed. Right: metal-shader-expert achieves 46% compression with near-zero quality loss.
R-D curves: green = quality improved, red = degraded. The knee of the curve is the optimal compression point.

code-review-checklist: The Anti-Compression Finding

Variant Tokens Score Change
baseline 522 0.646
-list_block (46tk) 476 0.731 +0.085
-list_block (101tk) 421 0.669 +0.023
-list_block (53tk) 469 0.654 +0.008

Every chunk removal improved quality. The R-D curve goes down — compression reduces distortion.

Why? code-review-checklist is 63 lines about generating code review checklists. Claude already knows how to review code. The skill's list blocks (specific checklist categories) constrain Claude to match those particular items rather than thinking freely. Remove the constraints, and Claude draws from its broader knowledge to produce better reviews.

Implication: Skills about well-known topics should be thin. Their value is in routing ("this user needs a code review checklist"), not content (providing the actual items). Adding detailed content about topics Claude already knows can actively hurt quality.

metal-shader-expert: 46% Compression, Zero Loss

Variant Tokens Score Change
baseline 4,781 0.860
-pbr-shaders.md 3,636 0.872 -0.012
-noise-effects.md 3,739 0.860 0.000
-debug-tools.md 3,532 0.793 +0.067

Two of three reference files are dispensable. The skill compresses from 4,781 to ~2,600 tokens (46%) with zero quality loss. Claude's training data covers PBR shading and noise functions. The references confirm rather than teach.

The real value is in the SKILL.md body: the shibboleths table (half vs float precision, TBDR tile-based deferred rendering, intersector API), the anti-pattern warnings, and the Apple GPU Family 9 specifics. That's ~1,400 tokens of irreducible signal — the stuff Claude doesn't know from training.

windagszip: The Skill That Compresses Itself

The most meta test: running windagszip's own eval pipeline on the windagszip skill itself.

Chunk Removed Tokens Score Drop Pass%
baseline (full skill) 2,363 0.892 100%
"Step 3: Validate Quality" paragraph + code block -118 0.908 -0.016 100%
"Human intuition is unreliable" paragraph -107 0.888 +0.004 100%
"Two compression regimes" paragraph + code block -107 0.885 +0.007 93%
"More detail always helps" anti-pattern paragraph -119 0.885 +0.007 93%
Output contract list (4 items) -141 0.831 +0.061 93%

The R-D curve tells the compression story:

Chunks Removed Tokens Compression Distortion
0 (baseline) 2,363 100% 0.000
1 2,245 95.0% -0.016
2 2,138 90.5% -0.011
3 2,031 86.0% -0.004
4 1,912 80.9% +0.003

Four chunks removed. 451 tokens saved. 19.1% compression. And the cumulative distortion is +0.003 — indistinguishable from noise.

The removed chunks are fascinating:

ablate-016 (quality improved when removed): Instructions on how to run eval_judge.py — the exact tool being used to evaluate it. The executor model doesn't need eval CLI commands to demonstrate compression methodology. The chunk is meta-operational, not content-bearing.

ablate-028, ablate-019, ablate-027 (noise-level drops): Paragraphs explaining why the approach works — "human intuition is unreliable," "two compression regimes," "more detail doesn't always help." These are persuasive prose, not executable knowledge. Claude already knows information theory.

ablate-042 (real quality loss): The output contract — the four-bullet list specifying what windagszip produces (redundancy analysis report, compressed variants, variants manifest, quality evaluation results). This chunk carries the most load because it defines the skill's interface. Without it, the executor produces vague responses that miss concrete deliverables. The grader flags these as fails.

The pattern: Interface definitions and domain-specific details are load-bearing. Explanatory prose about well-known concepts is not. The skill's value isn't in explaining why compression works — it's in specifying what to do and what to produce.

The Two Regimes

Two compression regimes — code-heavy skills detected by embeddings for free; knowledge-dense skills need graded eval.
Code-heavy on the left, knowledge-dense on the right. Different tools for different fat.

Skills fall into two compression regimes:

Code-heavy (>20% intra-skill redundancy): web-weather-creator, modern-auth-2026. These are fat with duplicate code examples. Compress them with embeddings alone — no API costs.

Knowledge-dense (<5% intra-skill): the other 8 skills. Each chunk is unique within the skill, but duplicates Claude's training data. Skip embeddings (there's nothing to find). Go straight to graded eval. The compression savings come from pretraining overlap.

The Optimal Pipeline

WinDAGSZip algorithm — Pass 1 (embedding dedup) is free and local. Pass 2 (graded eval) uses API calls on survivors only.
The two-pass pipeline. Pass 1 is free. Pass 2 runs only on survivors.

Always run embeddings first. The free pass reduces the surface area for the expensive pass. On web-weather-creator, this means graded eval runs on 75% of the original tokens instead of 100% — saving ~25% of API costs for Pass 2.

Packaging as WinDAGSZip

We packaged the pipeline as a distributable skill: skills/windagszip/SKILL.md. A skill about compressing skills — properly meta.

The embedding model ships with WinDAGs as a first-pass compressor. The graded eval is optional — it requires API calls, so it's opt-in for users who want to trace the full R(D) curve.

Running It

WinDAGSZip terminal demo — analyzing web-weather-creator (25% redundancy) and the 10-skill summary table

# Analyze redundancy (free)
cd tools/skill-compression
python embed_ablate.py web-weather-creator

# Generate compressed variants
python embed_ablate.py web-weather-creator --generate

# Validate quality (costs ~$0.011/case)
python eval_judge.py web-weather-creator \
  --variants ablations/web-weather-creator/redundancy-variants.jsonl

Distribution

WinDAGSZip ships alongside next-move and agentic-patterns — three skills in the first wave of public distribution:

  • next-move: Analyzes the current project state and recommends what to do next
  • agentic-patterns: The five pillars of effective agent behavior (decomposition, orchestration, recovery, context management, quality self-assessment)
  • windagszip: Compress any skill with embeddings + optional graded eval

Practical Guidance

When to Compress

  • Before deployment: Shrink your skill library before shipping
  • After major skill edits: Re-run to catch newly introduced redundancy
  • During library audits: The redundancy overview chart is a dashboard for skill health

What the Numbers Mean

Redundancy Meaning Action
>20% Heavy self-duplication Embeddings alone
5-20% Moderate Embeddings first, then graded eval
<5% Minimal self-duplication Skip to graded eval
0% No intra-skill redundancy All savings from pretraining overlap

Quality Thresholds

See the eval rubric at the top of this post for score interpretation, measurement noise, and the relationship to Anthropic's skill-creator evaluation framework.

The 23MB embedding model is sufficient for intra-document comparison. You don't need SOTA. All chunks share topical context (they're from the same skill), so even a small model detects semantic overlap.

What's Next

This post measured redundancy within individual skills. Post 3 looks across the boundary: if two skills produce nearly identical eval vectors on a shared test suite, they're behavioral duplicates regardless of how different their text looks.

The plan:

  • Cross-skill deduplication: Build the full behavior matrix (195 skills × N test cases). PCA reveals the natural dimensionality of the skill library. If 195 skills collapse to ~30 principal components, we have massive inter-skill redundancy.
  • Skill interpolation: Given related skills A and B, compress as shared_base + delta_A + delta_B. Delta encoding for knowledge.

The tools from this post — the chunker, the embedder, the graded eval — carry forward. The compression pipeline gets more powerful as we move from within-skill to between-skill analysis.


The WinDAGSZip skill and other public skills are available at github.com/curiositech/some_claude_skills. The compression tools are under tools/skill-compression/.