What happens when two skill-improvers improve each other?

skillsmetaanthropicevaluationrecursioncomposition

When Meta-Skills Collide

We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.

Friday, March 6, 2026

TL;DR: We ran two competing AI skill-evaluation tools against each other — Anthropic's open-source skill-creator and our skill-architect. Both violated their own rules. Neither caught what the other caught. Self-evaluation has a structural blind spot that cross-evaluation fills. The experiment converged to A- from both directions.

What's inside

Inspo — Skill-Coach^5 origin story

Contenders — factory vs. library

Evaluate — initial scorecards

Improve — mutual fixes

Cross-Spiral — the blind spots

Conclusion — 7 learnings

You're picking a "skill builder" for the weekend. Anthropic ships one. We ship one. They're both open-source, both promise to help you build better Claude skills, both look serious. You don't have time to evaluate both. Which one's better?

So we did the experiment for you. We pointed each at the other, three rounds, recorded everything. The transcripts and diffs are in the tabs below — but the short version is that they're not better-or-worse. They're philosophically different, in a way that becomes really clear when you watch them critique each other. One values the factory floor (run the eval, ship the score). One values the library (encode the failure modes, keep the rubric). When they cross, each catches what the other can't see.

Here's how that happened.

In November 2025, right after I launched someclaudeskills.com, I noticed a gap. Some skills I knew from the inside out — hard-won career knowledge I'd been carrying around for years. (Production systems frequently do better with logistic regression on simple obvious features than with carefully-built bespoke ML. If you're rigging 3D avatar blendshapes, plan your correctives before you're done with the base shapes, not after.) Others I had basically no idea how to build. I was just writing down what I thought I knew and hoping it would hold up when an agent actually tried to use it.

That gap bothered me. So I built Skill-Coach — a meta-skill whose whole job is to look at other skills and make them better. And then immediately: why not run it on itself?

Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement — our first time meta-skilling in November 2025

Our first time meta-skilling in November 2025. Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement

A skill (in the Claude Code skills framework) is a SKILL.md file — a structured document that injects domain expertise into an agent before it runs. A meta-skill is a skill whose job is to evaluate and improve other skills. We have two of them. One is Anthropic's, open-sourced under Apache 2.0. One is ours. We pointed them at each other and recorded what happened.

SKILL CREATOR (SC)

10 Python scripts

3 evaluation agents

1,325 lines of HTML viewer

Philosophy: The Factory Floor

SKILL ARCHITECT (SA)

13 reference documents

4 validation scripts

Anti-pattern catalog + rubric

Philosophy: The Library

Skill Creator (SC) is Anthropic's meta-skill: 10 Python scripts, 3 evaluation agents, 1,325 lines of interactive HTML viewer. A factory floor. Skill Architect (SA) is WinDAGs' meta-skill: 13 reference documents, a scoring rubric, an anti-pattern catalog, 4 validation scripts. A library.

Here's the map — SA and SC are functions, SA(SC) means "skill-architect evaluates and improves skill-creator," ∘ is composition. Click any node to see the file tree and diffs for that version:

Composition Algebra: Click a Node

The Original Meta-Skill

Skill-Coach was the experiment that started all of this. Its job: look at any skill and make it better. Better triggering, cleaner structure, more honest about what it can and can't do. Eight reference files, a recursive self-improvement workflow, and then: run it on itself five times.

skill-coach/
├── SKILL.md                           ~400 lines
├── CHANGELOG.md
├── scripts/
│   ├── validate_skill.py
│   ├── check_self_contained.py
│   └── test_activation.py
└── references/
    ├── antipatterns.md                anti-pattern catalog with case studies
    ├── shibboleths.md                 expert vs novice vocabulary patterns
    ├── validation-checklist.md        complete review and testing guide
    ├── self-contained-tools.md        scripts, MCPs, and subagent patterns
    ├── scoring-rubric.md              quantitative 0-10 skill evaluation
    ├── skill-composition.md           cross-skill dependencies
    ├── skill-lifecycle.md             versioning and deprecation
    └── mcp_vs_scripts.md              when to use Skills vs Agents vs MCPs

How It Works

Skill-Coach applies a six-step creation process and a progressive disclosure philosophy:

Phase 1 (~100 tokens): Metadata — "should I activate?"
Phase 2 (<5k tokens): SKILL.md — "how do I do this?"
Phase 3 (as needed): References — "show me the details"

The description formula: [What] [Use for] [Keywords] NOT for [Exclusions]. Its own description is the example: "Guides creation of high-quality Agent Skills... Activate on: create skill, review skill, skill quality... NOT for general coding advice, slash commands, MCP development."

The recursive self-improvement workflow uses its own scripts:

python scripts/validate_skill.py <path>       # structural check
python scripts/check_self_contained.py <path>  # phantom reference check
python scripts/test_activation.py <path>       # activation rate check

Address ERRORS first, then WARNINGS, then SUGGESTIONS. Update CHANGELOG.md. Re-run until clean.

What Skill-Coach^5 Found About Itself

Each generation found something the previous one was too close to see:

Generation	What It Found	What Changed
SK₀ → SK₁	Description triggered on "make my prompt better"	Narrowed to skill-specific vocabulary
SK₁ → SK₂	Improvement workflow assumed full folder access without saying so	Added explicit folder-reading step
SK₂ → SK₃	No NOT clause — fired on generic "quality review" queries	Added exclusion for non-skill content
SK₃ → SK₄	`scoring-rubric.md` referenced criteria not defined in it	Added definitions, linked to examples
SK₄ → SK₅	Shibboleths section wasn't itself written using shibboleths	Rewrote using domain vocabulary throughout

By SK₅: tighter triggering, self-consistent examples, a workflow that matched its own structure. Cleaner, not longer.

Why It Matters Here

Skill-Coach established that a meta-skill can improve itself. SA and SC are meta-skills with different improvement philosophies. The question this experiment asks: what happens when you cross-apply them instead of self-applying them? Do they find the same things?

They don't.

Thank You, Anthropic

This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance — every file, every script, every agent definition. For the latest version: their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.

The Setup

Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash.

claude -p "You are Skill Architect. Your source folder: {sa_path}.
           Evaluate and improve skill-creator at: {sc_path}.
           Write all improvements to: {output_path}." \
  --allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
  --permission-mode bypassPermissions

Each agent read its own skill folder — references, scripts, examples — before scoring the target. Layer 1: SKILL.md only. Layer 2: full folder access. Layer 3: after each skill self-improved, cross-evaluate again.

Three rounds. Six crossings. The braid is below.

Evaluation Matrix

All four evaluator × target combinations. Cross-evaluations found what self-evaluations missed. Score shown as before → after.

Evaluator	Target	Round 1	Round 2	Round 3
SA skill-architect	SC skill-creator	5.1/10 +Added 2 Mermaid diagrams (creation flow + eval loop) +Added NOT clause to description +Added shibboleth section with templates +Moved platform sections → references/ (~70 lines)	6.0→8/10 +Added description-optimization.md +Added platform-notes.md +Reduced SKILL.md 485→383 lines +Added reference index	8.2→8.8/10 +Found 525-line violation (SC's own 500-line rule) +Added missing eval-loop Mermaid diagram +Added NOT clause to description +Created CHANGELOG.md
SC skill-creator	SA skill-architect	6.7/10 +Rewrote description for trigger clarity +Added Output Contracts section +Added activation flowchart +Identified 505-line violation	7.6/10 +Fixed validate_mermaid.py bug (identical error/warning icons) +Removed phantom reference in antipatterns.md +Converted 6 ASCII diagrams → Mermaid +Added troubleshooting.md	8.8→8.97/10 +Found EVALUATION.md phantom self-contamination +Extended validate_skill.py to scan all .md files +Fixed HTML entities in 4 reference files +Added <!-- phantom-ok --> annotation support
SA skill-architect	SA (self)	7.3/10 +Unrestricted Bash violates own least-privilege rule +Anti-patterns section doesn't use own shibboleth template +23-type Mermaid table bloating SKILL.md (belongs in references) +Progressive Disclosure scored 6/10 — content at wrong layer	7.3→8.8/10 +Scoped Bash to Bash(python:*) — fixed own least-privilege violation +Self-Containment 6→9 (+3): fixed all 7 phantom references +Visual Artifacts 5→9 (+4): added progressive disclosure diagram +Reduced 504→467 lines, still missed EVALUATION.md phantom	8.8→8.9/10 +R2's 8.8 was inflated: re-assessed as 7.4 (broken Mermaid, invented keys, phantoms) +Fixed 31+ HTML entities breaking Mermaid rendering in 5 reference files +Removed invented frontmatter keys contradicting own Invalid Keys guidance +Compressed SKILL.md 466→381 lines; added Self-Consistency as 7th dimension
SC skill-creator	SC (self)	7/10 +Description not 'pushy' — violates own optimization advice (ironic) +run-1/ path missing breaks aggregate_benchmark.py (functional bug) +No grader subagent prompt template despite documenting grader flow +Self-Containment 4/10 — no resource inventory, graceful degradation	7.0→8.2/10 +Rewrote description imperative and 'pushy' (own medicine) +Added resources inventory with graceful degradation paths +Fixed run-1/ paths in all output templates +Reordered grader steps 7↔8 (can't write timing before reading it)	8.2→8.6/10 +Fixed description voice: second-person → imperative (own medicine, again) +Fixed aggregate_benchmark.py eval_id parsing — root cause, not just docs +Removed "Cool? Cool." colloquialism breaking instructional tone +Added eval_metadata.json schema to schemas.md (referenced but undefined)

Cross-Evaluation

SA evaluates SC — R1

5.1/10

+Added 2 Mermaid diagrams (creation flow + eval loop)
+Added NOT clause to description
+Added shibboleth section with templates
+Moved platform sections → references/ (~70 lines)

SA evaluates SC — R2

6.0→8/10

+Added description-optimization.md
+Added platform-notes.md
+Reduced SKILL.md 485→383 lines
+Added reference index

SA evaluates SC — R3

8.2→8.8/10

+Found 525-line violation (SC's own 500-line rule)
+Added missing eval-loop Mermaid diagram
+Added NOT clause to description
+Created CHANGELOG.md

SC evaluates SA — R1

6.7/10

+Rewrote description for trigger clarity
+Added Output Contracts section
+Added activation flowchart
+Identified 505-line violation

SC evaluates SA — R2

7.6/10

+Fixed validate_mermaid.py bug (identical error/warning icons)
+Removed phantom reference in antipatterns.md
+Converted 6 ASCII diagrams → Mermaid
+Added troubleshooting.md

SC evaluates SA — R3

8.8→8.97/10

+Found EVALUATION.md phantom self-contamination
+Extended validate_skill.py to scan all .md files
+Fixed HTML entities in 4 reference files
+Added  annotation support

Self-Evaluation

SA self-eval — R1

7.3/10

+Unrestricted Bash violates own least-privilege rule
+Anti-patterns section doesn't use own shibboleth template
+23-type Mermaid table bloating SKILL.md (belongs in references)
+Progressive Disclosure scored 6/10 — content at wrong layer

SA self-eval — R2

7.3→8.8/10

+Scoped Bash to Bash(python:*) — fixed own least-privilege violation
+Self-Containment 6→9 (+3): fixed all 7 phantom references
+Visual Artifacts 5→9 (+4): added progressive disclosure diagram
+Reduced 504→467 lines, still missed EVALUATION.md phantom

SA self-eval — R3

8.8→8.9/10

+R2's 8.8 was inflated: re-assessed as 7.4 (broken Mermaid, invented keys, phantoms)
+Fixed 31+ HTML entities breaking Mermaid rendering in 5 reference files
+Removed invented frontmatter keys contradicting own Invalid Keys guidance
+Compressed SKILL.md 466→381 lines; added Self-Consistency as 7th dimension

SC self-eval — R1

7/10

+Description not 'pushy' — violates own optimization advice (ironic)
+run-1/ path missing breaks aggregate_benchmark.py (functional bug)
+No grader subagent prompt template despite documenting grader flow
+Self-Containment 4/10 — no resource inventory, graceful degradation

SC self-eval — R2

7.0→8.2/10

+Rewrote description imperative and 'pushy' (own medicine)
+Added resources inventory with graceful degradation paths
+Fixed run-1/ paths in all output templates
+Reordered grader steps 7↔8 (can't write timing before reading it)

SC self-eval — R3

8.2→8.6/10

+Fixed description voice: second-person → imperative (own medicine, again)
+Fixed aggregate_benchmark.py eval_id parsing — root cause, not just docs
+Removed "Cool? Cool." colloquialism breaking instructional tone
+Added eval_metadata.json schema to schemas.md (referenced but undefined)

Score Evolution

Four evaluation paths. Hover any path to isolate it and see who is evaluating whom.

Hover a path in the legend to see the full story.

File Evolution Explorer

Pick a journey: who evaluated whom? Then browse every file across all rounds. Colored dots show which versions contain each file.

Target: Skill-Architect

Target: Skill-Creator

Evaluator: SA

Evaluator: SC

SA evaluates SCskill-architect improves skill-creator

SC₀ (5.1)SC₁ (8.0)SC₂ (8.8)

SC₀

SC₁

SC₂

SC₀ (original)

SC₁ (SA-improved)

SC₂ (SA₁-improved)

Discussion

Scroll down to load comments

Back to all posts