
What happens when two skill-improvers improve each other?
When Meta-Skills Collide
We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.
TL;DR: We ran two competing AI skill-evaluation tools against each other — Anthropic's open-source skill-creator and our skill-architect. Both violated their own rules. Neither caught what the other caught. Self-evaluation has a structural blind spot that cross-evaluation fills. The experiment converged to A- from both directions.
Contenders — factory vs. library
Evaluate — initial scorecards
Improve — mutual fixes
Cross-Spiral — the blind spots
Conclusion — 7 learnings
You're picking a "skill builder" for the weekend. Anthropic ships one. We ship one. They're both open-source, both promise to help you build better Claude skills, both look serious. You don't have time to evaluate both. Which one's better?
So we did the experiment for you. We pointed each at the other, three rounds, recorded everything. The transcripts and diffs are in the tabs below — but the short version is that they're not better-or-worse. They're philosophically different, in a way that becomes really clear when you watch them critique each other. One values the factory floor (run the eval, ship the score). One values the library (encode the failure modes, keep the rubric). When they cross, each catches what the other can't see.
Here's how that happened.
In November 2025, right after I launched someclaudeskills.com, I noticed a gap. Some skills I knew from the inside out — hard-won career knowledge I'd been carrying around for years. (Production systems frequently do better with logistic regression on simple obvious features than with carefully-built bespoke ML. If you're rigging 3D avatar blendshapes, plan your correctives before you're done with the base shapes, not after.) Others I had basically no idea how to build. I was just writing down what I thought I knew and hoping it would hold up when an agent actually tried to use it.
That gap bothered me. So I built Skill-Coach — a meta-skill whose whole job is to look at other skills and make them better. And then immediately: why not run it on itself?

Our first time meta-skilling in November 2025. Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement
A skill (in the Claude Code skills framework) is a SKILL.md file — a structured document that injects domain expertise into an agent before it runs. A meta-skill is a skill whose job is to evaluate and improve other skills. We have two of them. One is Anthropic's, open-sourced under Apache 2.0. One is ours. We pointed them at each other and recorded what happened.
3 evaluation agents
1,325 lines of HTML viewer
Philosophy: The Factory Floor
4 validation scripts
Anti-pattern catalog + rubric
Philosophy: The Library
Skill Creator (SC) is Anthropic's meta-skill: 10 Python scripts, 3 evaluation agents, 1,325 lines of interactive HTML viewer. A factory floor. Skill Architect (SA) is WinDAGs' meta-skill: 13 reference documents, a scoring rubric, an anti-pattern catalog, 4 validation scripts. A library.
Here's the map — SA and SC are functions, SA(SC) means "skill-architect evaluates and improves skill-creator," ∘ is composition. Click any node to see the file tree and diffs for that version:
Composition Algebra: Click a Node
The Original Meta-Skill
Skill-Coach was the experiment that started all of this. Its job: look at any skill and make it better. Better triggering, cleaner structure, more honest about what it can and can't do. Eight reference files, a recursive self-improvement workflow, and then: run it on itself five times.
skill-coach/
├── SKILL.md ~400 lines
├── CHANGELOG.md
├── scripts/
│ ├── validate_skill.py
│ ├── check_self_contained.py
│ └── test_activation.py
└── references/
├── antipatterns.md anti-pattern catalog with case studies
├── shibboleths.md expert vs novice vocabulary patterns
├── validation-checklist.md complete review and testing guide
├── self-contained-tools.md scripts, MCPs, and subagent patterns
├── scoring-rubric.md quantitative 0-10 skill evaluation
├── skill-composition.md cross-skill dependencies
├── skill-lifecycle.md versioning and deprecation
└── mcp_vs_scripts.md when to use Skills vs Agents vs MCPs
How It Works
Skill-Coach applies a six-step creation process and a progressive disclosure philosophy:
- Phase 1 (~100 tokens): Metadata — "should I activate?"
- Phase 2 (<5k tokens): SKILL.md — "how do I do this?"
- Phase 3 (as needed): References — "show me the details"
The description formula: [What] [Use for] [Keywords] NOT for [Exclusions]. Its own description is the example: "Guides creation of high-quality Agent Skills... Activate on: create skill, review skill, skill quality... NOT for general coding advice, slash commands, MCP development."
The recursive self-improvement workflow uses its own scripts:
python scripts/validate_skill.py <path> # structural check
python scripts/check_self_contained.py <path> # phantom reference check
python scripts/test_activation.py <path> # activation rate check
Address ERRORS first, then WARNINGS, then SUGGESTIONS. Update CHANGELOG.md. Re-run until clean.
What Skill-Coach^5 Found About Itself
Each generation found something the previous one was too close to see:
| Generation | What It Found | What Changed |
|---|---|---|
| SK₀ → SK₁ | Description triggered on "make my prompt better" | Narrowed to skill-specific vocabulary |
| SK₁ → SK₂ | Improvement workflow assumed full folder access without saying so | Added explicit folder-reading step |
| SK₂ → SK₃ | No NOT clause — fired on generic "quality review" queries | Added exclusion for non-skill content |
| SK₃ → SK₄ | scoring-rubric.md referenced criteria not defined in it |
Added definitions, linked to examples |
| SK₄ → SK₅ | Shibboleths section wasn't itself written using shibboleths | Rewrote using domain vocabulary throughout |
By SK₅: tighter triggering, self-consistent examples, a workflow that matched its own structure. Cleaner, not longer.
Why It Matters Here
Skill-Coach established that a meta-skill can improve itself. SA and SC are meta-skills with different improvement philosophies. The question this experiment asks: what happens when you cross-apply them instead of self-applying them? Do they find the same things?
They don't.
Thank You, Anthropic
This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance — every file, every script, every agent definition. For the latest version: their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.
The Setup
Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash.
claude -p "You are Skill Architect. Your source folder: {sa_path}.
Evaluate and improve skill-creator at: {sc_path}.
Write all improvements to: {output_path}." \
--allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
--permission-mode bypassPermissions
Each agent read its own skill folder — references, scripts, examples — before scoring the target. Layer 1: SKILL.md only. Layer 2: full folder access. Layer 3: after each skill self-improved, cross-evaluate again.
Three rounds. Six crossings. The braid is below.
Evaluation Matrix
All four evaluator × target combinations. Cross-evaluations found what self-evaluations missed. Score shown as before → after.
| Evaluator | Target | Round 1 | Round 2 | Round 3 |
|---|---|---|---|---|
| SA skill-architect | SC skill-creator | 5.1/10
| 6.0→8/10
| 8.2→8.8/10
|
| SC skill-creator | SA skill-architect | 6.7/10
| 7.6/10
| 8.8→8.97/10
|
| SA skill-architect | SA (self) | 7.3/10
| 7.3→8.8/10
| 8.8→8.9/10
|
| SC skill-creator | SC (self) | 7/10
| 7.0→8.2/10
| 8.2→8.6/10
|
Cross-Evaluation
- +Added 2 Mermaid diagrams (creation flow + eval loop)
- +Added NOT clause to description
- +Added shibboleth section with templates
- +Moved platform sections → references/ (~70 lines)
- +Added description-optimization.md
- +Added platform-notes.md
- +Reduced SKILL.md 485→383 lines
- +Added reference index
- +Found 525-line violation (SC's own 500-line rule)
- +Added missing eval-loop Mermaid diagram
- +Added NOT clause to description
- +Created CHANGELOG.md
- +Rewrote description for trigger clarity
- +Added Output Contracts section
- +Added activation flowchart
- +Identified 505-line violation
- +Fixed validate_mermaid.py bug (identical error/warning icons)
- +Removed phantom reference in antipatterns.md
- +Converted 6 ASCII diagrams → Mermaid
- +Added troubleshooting.md
- +Found EVALUATION.md phantom self-contamination
- +Extended validate_skill.py to scan all .md files
- +Fixed HTML entities in 4 reference files
- +Added <!-- phantom-ok --> annotation support
Self-Evaluation
- +Unrestricted Bash violates own least-privilege rule
- +Anti-patterns section doesn't use own shibboleth template
- +23-type Mermaid table bloating SKILL.md (belongs in references)
- +Progressive Disclosure scored 6/10 — content at wrong layer
- +Scoped Bash to Bash(python:*) — fixed own least-privilege violation
- +Self-Containment 6→9 (+3): fixed all 7 phantom references
- +Visual Artifacts 5→9 (+4): added progressive disclosure diagram
- +Reduced 504→467 lines, still missed EVALUATION.md phantom
- +R2's 8.8 was inflated: re-assessed as 7.4 (broken Mermaid, invented keys, phantoms)
- +Fixed 31+ HTML entities breaking Mermaid rendering in 5 reference files
- +Removed invented frontmatter keys contradicting own Invalid Keys guidance
- +Compressed SKILL.md 466→381 lines; added Self-Consistency as 7th dimension
- +Description not 'pushy' — violates own optimization advice (ironic)
- +run-1/ path missing breaks aggregate_benchmark.py (functional bug)
- +No grader subagent prompt template despite documenting grader flow
- +Self-Containment 4/10 — no resource inventory, graceful degradation
- +Rewrote description imperative and 'pushy' (own medicine)
- +Added resources inventory with graceful degradation paths
- +Fixed run-1/ paths in all output templates
- +Reordered grader steps 7↔8 (can't write timing before reading it)
- +Fixed description voice: second-person → imperative (own medicine, again)
- +Fixed aggregate_benchmark.py eval_id parsing — root cause, not just docs
- +Removed "Cool? Cool." colloquialism breaking instructional tone
- +Added eval_metadata.json schema to schemas.md (referenced but undefined)
Score Evolution
Four evaluation paths. Hover any path to isolate it and see who is evaluating whom.
Hover a path in the legend to see the full story.
File Evolution Explorer
Pick a journey: who evaluated whom? Then browse every file across all rounds. Colored dots show which versions contain each file.
Loading...
Loading...
Loading...