When Meta-Skills Collide
We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.
When Meta-Skills Collide
Two AI skill-creation tools. Each thinks it knows how to build skills. We pointed them at each other and recorded what happened.
One is Anthropic's, open-sourced under Apache 2.0. One is ours. Both are meta-skills -- skills whose entire job is to create, evaluate, and improve other skills. We built an agent to wield each one against the other, captured every word, and now we're going to show you exactly what they said.
No summaries. No paraphrasing. The transcripts are right here.
Thank You, Anthropic
This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance -- every file, every script, every agent definition.
For the latest version, go to their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.
The Contenders: See For Yourself
Before we tell you what we found, look at both skills. Not a summary -- the actual folders. Every file, with line counts. Click the file buttons to browse references, scripts, and agents.
Anthropic's skill-creator
skill-creator/ (485 lines SKILL.md, 33,168 bytes)
├── SKILL.md 485 lines
├── LICENSE.txt 201 lines
├── agents/
│ ├── analyzer.md 275 lines
│ ├── comparator.md 203 lines
│ └── grader.md 224 lines
├── assets/
│ └── eval_review.html 45 lines
├── eval-viewer/
│ ├── generate_review.py 312 lines
│ └── viewer.html 1,247 lines
├── references/
│ └── schemas.md 306 lines
└── scripts/
├── __init__.py 0 lines
├── aggregate_benchmark.py 187 lines
├── generate_report.py 89 lines
├── improve_description.py 142 lines
├── package_skill.py 98 lines
├── quick_validate.py 72 lines
├── run_eval.py 156 lines
├── run_loop.py 234 lines
└── utils.py 47 lines
Total: 19 files. 9 Python scripts that actually run. 3 agent definitions for evaluation pipelines. An HTML eval viewer that generates standalone review pages. This is a factory floor.
WinDAGs' skill-architect
skill-architect/ (503 lines SKILL.md, 23,647 bytes)
├── SKILL.md 503 lines
├── CHANGELOG.md 139 lines
├── README.md 132 lines
├── agents/
│ └── cross-evaluator.md 87 lines
├── scripts/
│ ├── validate_mermaid.py 649 lines
│ ├── validate_skill.py 310 lines
│ ├── check_self_contained.py 210 lines
│ └── init_skill.py 193 lines
└── references/
├── antipatterns.md 308 lines
├── claude-extension-taxonomy.md 344 lines
├── description-guide.md 188 lines
├── knowledge-engineering.md 290 lines
├── mcp-template.md 118 lines
├── plugin-architecture.md 220 lines
├── scoring-rubric.md 82 lines
├── self-contained-tools.md 209 lines
├── skill-composition.md 87 lines
├── skill-lifecycle.md 95 lines
├── subagent-design.md 248 lines
├── subagent-template.md 196 lines
└── visual-artifacts.md 428 lines
Total: 22 files. 13 reference documents spanning knowledge engineering to plugin architecture. 4 Python scripts for validation and scaffolding. A scoring rubric. An anti-pattern catalog with shibboleth templates. This is a library.
What Skill-Creator's Scripts Actually Do
Anthropic shipped real tooling, not templates. Here's the pipeline:
| Script | Lines | What It Does |
|---|---|---|
run_eval.py |
310 | Tests whether a skill's description triggers correctly. Spawns claude -p subprocesses for each eval query, captures whether the skill fired. JSON output. |
run_loop.py |
328 | The main loop: run_eval + improve_description, iterating until all pass or max iterations. Tracks history, supports train/test split to prevent overfitting. |
improve_description.py |
247 | Takes eval results and generates an improved description by calling claude -p. The improvement is guided by which queries failed to trigger. |
aggregate_benchmark.py |
401 | Reads grading.json files from run directories, produces mean/stddev/min/max for each metric, computes deltas between with-skill and without-skill configurations. |
generate_report.py |
326 | Generates a visual HTML report from run_loop output. Shows each description attempt with pass/fail for every test case, distinguishing train vs test. |
package_skill.py |
136 | Creates a distributable .skill file (zip archive) from a skill folder. Validates frontmatter before packaging. |
quick_validate.py |
102 | Checks frontmatter completeness: name, description, valid field names. Uses PyYAML. |
utils.py |
47 | Shared SKILL.md parser -- extracts name, description, and full content from frontmatter. |
The eval viewer (eval-viewer/generate_review.py, 312 lines + viewer.html, 1,247 lines) generates a standalone HTML page with two tabs: Outputs (review each test case, leave feedback) and Benchmark (quantitative comparison with pass rates, timing, token usage). It's a complete review workstation in a single HTML file.
The three agent definitions (agents/analyzer.md, agents/comparator.md, agents/grader.md) are prompts for specialized subagents. The grader evaluates assertions against outputs. The comparator does blind A/B comparison. The analyzer explains why one version beat another.
This is a factory floor. It's meant to be run, not read.
The Algebra
We have two operators and two artifacts. Here's the map of what we did:
Composition Algebra: Click a Node
SA and SC are functions. SA(SC) means "skill-architect evaluates and improves skill-creator." The ∘ is function composition. Click any node above to see the file tree, frontmatter, and diffs.
The question that drives this entire experiment: what happens when you iterate?
The Critiques
Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash. (Anthropic's skill-creator ships with three specialized agent definitions — agents/analyzer.md, agents/comparator.md, agents/grader.md. SC read and loaded them as part of reading its own folder.)
claude -p "You are Skill Architect. Your source folder: {sa_path}.
Evaluate and improve skill-creator at: {sc_path}.
Write all improvements to: {output_path}." \
--allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
--permission-mode bypassPermissions
Each agent read its own references and ran its scripts before scoring. Here's the initial scorecard and side-by-side breakdown:
Evaluation Scorecard
| Criterion | Baseline | WinDAGs |
|---|---|---|
| Frontmatter | 6 | 7 |
| Progressive Disclosure | 5 | 6 |
| Anti-Patterns | 4 | 4 |
| Visual Artifacts | 3 | 7 |
| Shibboleths | 7 | 4 |
| Self-Containment | 5 | 7 |
| Activation Quality | 6 | 8 |
| Total | 36 | 43 |
Layer 1: The Critiques
Missing allowed-tools, missing NOT clause
485 lines, duplicated JSON schemas in body + references
Commits 3 of 10 anti-patterns it teaches others to avoid
Zero Mermaid diagrams for 4+ complex workflows
Some implicit signals, no systematic encoding
Scripts are real and functional; no requirements.txt
Good adaptive behavior, but weak explicit triggers
Verbatim from transcript
There are **zero** Mermaid diagrams or visual artifacts anywhere in the SKILL.md or its references.
The skill's own validation script (quick_validate.py) checks allowed-tools as a valid field, yet the skill itself doesn't use it.
The skill teaches others to avoid these exact anti-patterns but commits three of them itself.
What it changed
- →Added 2 Mermaid diagrams (lifecycle flowchart + eval sequence)
- →Added NOT clause: "NOT for installing skills, general coding, skill browsing"
- →Added allowed-tools: Read,Write,Edit,Bash,Glob,Grep,TodoWrite,WebFetch
- →Reorganized into 7 numbered Phases matching lifecycle diagram
- →Added Common Mistakes shibboleth table
- →Removed "Cool? Cool." and informal tone markers
Practices what it preaches but uses jargon like "distribution surfaces"
Genuinely expert anti-patterns, novel shibboleth template
503 lines — violates its own <500 rule
Four working scripts, cross-evaluator agent
Exemplary CHANGELOG, but no version in frontmatter
Verbatim from transcript
The SKILL.md is 503 lines. The skill's own rule says '<500 lines.' This is a do-as-I-say-not-as-I-do violation.
'expert-level progressive disclosure' is insider jargon that no user will type. A plumber opening their terminal won't query about 'progressive disclosure.'
Some 'expert knowledge' is really just API documentation. These are facts, not shibboleths.
What it changed
- →Description rewritten: "my skill doesn't trigger" instead of "expert-level progressive disclosure"
- →Line count reduced from 503 to ~430 (under the 500-line rule!)
- →5 sections moved to references (Mermaid 23-type table, Rejection Causes, Platform Constraints, etc.)
- →Anti-pattern table gained "Why It Matters" column
- →Added version: 2.2.0 and last-reviewed date to frontmatter
- →Proposed new references/frontmatter-reference.md
Missing allowed-tools, missing NOT clause
485 lines, duplicated JSON schemas in body + references
Commits 3 of 10 anti-patterns it teaches others to avoid
Zero Mermaid diagrams for 4+ complex workflows
Some implicit signals, no systematic encoding
Scripts are real and functional; no requirements.txt
Good adaptive behavior, but weak explicit triggers
Verbatim from transcript
There are **zero** Mermaid diagrams or visual artifacts anywhere in the SKILL.md or its references.
The skill's own validation script (quick_validate.py) checks allowed-tools as a valid field, yet the skill itself doesn't use it.
The skill teaches others to avoid these exact anti-patterns but commits three of them itself.
What it changed
- →Added 2 Mermaid diagrams (lifecycle flowchart + eval sequence)
- →Added NOT clause: "NOT for installing skills, general coding, skill browsing"
- →Added allowed-tools: Read,Write,Edit,Bash,Glob,Grep,TodoWrite,WebFetch
- →Reorganized into 7 numbered Phases matching lifecycle diagram
- →Added Common Mistakes shibboleth table
- →Removed "Cool? Cool." and informal tone markers
What Skill-Architect Found in Skill-Creator
Score: 5.1/10 — then 6.0 → 8.0 with full folder access
The architect applied its seven-dimension rubric with forensic precision:
| Criterion | Score | Key Finding |
|---|---|---|
| Frontmatter | 6/10 | Missing NOT clause; description not precisely pushy |
| Progressive Disclosure | 5/10 | Near 500-line limit; platform-specific sections bloating SKILL.md |
| Anti-Patterns | 4/10 | No NOT clause, no Mermaid diagrams, platform bloat, no shibboleth section |
| Visual Artifacts | 3/10 | Only the directory tree; workflow, eval loop, and trigger optimization are prose |
| Shibboleths | 7/10 | Genuine domain expertise: trigger rates, when quantitative evals help, overfitting risk |
| Self-Containment | 5/10 | Scripts likely real but no explicit bundled resources list |
| Activation Quality | 6/10 | Correct triggers, but false-positive risk on adjacent "prompt engineering" queries |
On the visual artifacts finding:
"The only structural artifact is the directory tree. The creation workflow, the eval loop (spawn → draft assertions → grade → viewer → feedback → iterate), and the trigger optimization loop are all described as numbered steps in prose. These are textbook Mermaid diagram candidates."
Notably, skill-architect gave skill-creator a 7/10 on shibboleths — higher than skill-architect's own self-evaluation score on that dimension (4/10). skill-creator encodes real expertise about when quantitative evals help vs. don't, the overfitting risk in description optimization loops, and the importance of reading transcripts not just metrics. The architect recognized expertise it couldn't represent in its own format.
With its source folder loaded, SA ran its own validators and rubric before writing a single change. Score moved to 8.0. What changed in SC's folder:
| Addition | What it is |
|---|---|
description-optimization.md |
SA's description craft methodology — didn't exist in SC |
platform-notes.md |
Platform adaptation guide — was buried in SKILL.md body |
CHANGELOG.md |
Version history — didn't exist |
SA also added the two Mermaid diagrams it flagged as missing (a high-level creation-to-packaging flowchart and a sequence diagram of the eval loop) and a proper anti-pattern section with Novice/Expert/Timeline shibboleth templates.
SA's honest comparison:
"skill-creator is a more operationally sophisticated skill than skill-architect. It ships a complete eval pipeline, working scripts, three specialized agent definitions, a JSON schema reference, and a working HTML viewer — none of which sa0 ships in comparable completeness. In terms of self-contained tooling, sc0 is arguably better. Where sa0 exceeds sc0: structural rigor."
An evaluator acknowledging the target is better-equipped than itself.
What Skill-Creator Found in Skill-Architect
Score: 6.7/10 — then 7.6/10 with full folder access
The creator used its own seven-dimension rubric and found real gaps on the operational side:
| Dimension | Score | Key Finding |
|---|---|---|
| Triggering | 6/10 | "Design, create, audit, improve" misses natural phrasings: "write a skill", "build a skill" |
| Output Quality | 5/10 | Create has a template; audit, improve, debug have no output contract |
| Eval Loop Readiness | 7/10 | Validation checklist is assertion-ready; success metrics are measurable |
| Iteration Support | 7/10 | Clear metrics (>90% activation, <5% false positive); scoring rubric lives in references, not inline |
| Communication Clarity | 8/10 | Strong tables and diagrams; "shibboleth" appears before it's defined |
| Description Optimization | 6/10 | Lists specific operations; should lead with what users get, not what operations exist |
| Self-Containment | 8/10 | References indexed; one path ambiguity in script commands |
The output quality gap:
"Four operations are implied (create, audit, improve, debug) but only
createhas a defined output format. Audit, improve, and debug produce... something. An eval harness would struggle to grade audit or debug outputs automatically."
And the irony finding:
"The SKILL.md is 503 lines. The skill's own rule says '<500 lines.' This is a do-as-I-say-not-as-I-do violation."
With full folder access, SC ran SA's own validators against SA:
validate_skill.py:
[size] SKILL.md is 505 lines (max 500). ERROR.
check_self_contained.py:
Phantom reference: references/server-components-deep-dive.md (does not exist)
SA's SKILL.md violates its own 500-line rule. SA's own checker caught it.
Then SC went into the reference files. skill-lifecycle.md — lifecycle state machine rendered as ASCII box-drawing. Anti-pattern #10 in SA's own catalog says "use Mermaid." SC replaced it with a proper stateDiagram-v2. skill-composition.md — five ASCII art dependency diagrams, all converted to Mermaid flowcharts.
SC also found a real script bug in validate_mermaid.py:
# Line 565 — both branches produce identical empty strings
icon = " " if issue.severity == "error" else " "
Error and warning were visually identical in output. SC fixed it.
SC's biggest move: Output Contracts — a new section defining what each operation produces (create → SKILL.md file; audit → structured report with dimension scores; improve → complete rewritten SKILL.md with diff; debug → diagnosed root cause and concrete fix). This transforms the skill from instruction-only to assertion-ready.
What changed in SA's folder:
| Change | Type |
|---|---|
| SKILL.md: 505 → 476 lines | Compressed |
| Fixed false NOT clause (MCP excluded, but SA teaches MCP via 3 ref files) | Consistency |
validate_mermaid.py: error icon fixed |
Bug fix |
| Phantom reference removed from antipatterns.md | Phantom |
ASCII art → Mermaid in skill-lifecycle.md |
Diagram conversion |
ASCII art → Mermaid x5 in skill-composition.md |
Diagram conversion |
troubleshooting.md |
New file |
SC's honest comparison:
"skill-architect is more comprehensive than skill-creator in raw content — more shibboleths, more reference files, working scripts. skill-creator's advantage is tighter discipline around eval methodology and assertion-based quality measurement. If these two skills were composed, the combined quality would exceed either alone."
Self-Evaluations: Mirrors Turned Inward
We also ran SA on itself and SC on itself with full folder access.
SA on SA: SA ran check_self_contained.py against itself. The script returned 7 "phantom" references. 5 were false positives — the checker was matching reference patterns inside illustrative prose. SA's quality gate was fundamentally broken: it was flagging its own documentation examples as missing files. SA fixed check_self_contained.py with an ILLUSTRATIVE_MARKERS regex, then added activation-debugging.md — a gap it found by noticing the skill listed activation debugging as a use case but shipped no content for it.
Grade: 7.3 → 8.8/10 (B → B+)
SC on SC: SC found a functional path bug. The SKILL.md tells users to save outputs to eval-<ID>/with_skill/outputs/. But aggregate_benchmark.py expects grading.json at eval-<ID>/with_skill/run-*/grading.json. The aggregator skips directories with no run-* subdirs. Result: benchmark.json would always be empty. A silent bug that a user would only discover after running a complete eval cycle. SC fixed it, rewrote the description to follow its own "pushy principle," and added eval-patterns.md.
Score: 8.2/10
What the Tools Revealed
Each evaluator's tools revealed a different dimension of quality.
SA's tools are architectural probes. They find what's missing: no diagram, no NOT clause, no anti-pattern section. They don't fire on existing content.
SC's tools are consistency probes. They find what's wrong: script bugs, path inconsistencies, self-contradictions. They don't add new architectural layers.
Both revealed something text couldn't. SA's own check_self_contained.py was generating false positives from its own illustrative prose — the skill was failing the very quality gate it ships. SC's aggregate_benchmark.py would always produce empty output because of a missing directory level. Neither failure was visible from reading SKILL.md alone. Both were invisible until you ran the code.
A skill you can't run is less self-contained than it appears.
The Iteration Paths
The Braid: Layer 1 Crossings
Now that we have four artifacts (see the algebra diagram above), there are three paths for iteration:
Path A: Fixed-Base -- SA(SC₁) = SC₂, SA(SC₂) = SC₃... Same evaluator grinding. Measures convergence.
Path B: Self-Reflective -- SA(SA) = SA₁, then SA₁(SC). Does self-improvement make you a better evaluator?
Path C: Cross-Spiral -- Each generation uses the OTHER's latest version. This is the braid. Does the diagram commute? SC(SA(SC)) vs SA(SC(SA))? Almost certainly not. The difference reveals what each evaluator cannot see about itself.
Round 3: The Cross-Spiral
In Round 2, each skill evaluated the other from scratch. In Round 3, we let each skill improve itself first — then sent its improved version to evaluate the other's improved version. SA₁ (SA after self-evaluating its full folder) evaluated SC₁ (SC after self-evaluating its full folder). SC₁ evaluated SA₁. The question: what does a cross-evaluator find after both sides have already cleaned their own houses?
SA₁ Evaluates SC₁
Score: 8.2 → 8.8/10 (A−)
SA₁ ran its validation scripts against SC₁ and found three things SC₁'s self-evaluation missed:
526 lines. SC₁ had grown past 500 lines during self-improvement (iter-2 added a workspace diagram and a grader prompt template without extracting anything). SC's scripts don't enforce the 500-line rule — SA's do. SC₁ couldn't catch itself violating a rule it doesn't measure.
No Mermaid diagram for the core loop. SC₁'s central concept — the create/test/grade/improve eval loop — is described in prose. SA₁'s cross-evaluator spotted this immediately; it's one of SA's six evaluation dimensions (Visual Artifacts). SC₁ evaluated itself against its own rubric, which doesn't include that dimension. The diagram that was missing was the one SC was most accustomed to not having.
No NOT clause. SC₁'s description still lacked the exclusion clause SA's rubric requires. Again: SC₁'s rubric doesn't mandate NOT clauses. It couldn't flag their absence in itself.
SA₁'s summary of the finding:
"self-evaluators don't notice what they're used to reading."
SA₁'s convergence assessment: the diff from iter-2 to iter-3 is "meaningfully smaller" than iter-1 to iter-2. Convergence confidence: high.
SC₁ Evaluates SA₁
Score: 8.8 → 8.97/10 (A−)
SC₁ ran SA₁'s validation scripts cold and immediately found something the self-evaluation had missed — because the self-evaluation couldn't have seen it.
SA₁'s EVALUATION.md described the phantom reference fixes it made. To explain which paths had been false positives, it cited them inline:
"`scripts/analyze.py` (false positive — illustrative)"
"`` `references/X.md` `` (false positive — illustrative placeholder)"
"referenced `references/api-guide.md`"
"cited `scripts/validate.py`"
SC₁ ran check_self_contained.py on the finished output and it failed — on exactly those four lines. The evaluation document describing the phantom-detection fix was itself triggering phantom detection. SA₁ had declared the checker now passed. The checker was failing on SA₁'s own words about passing.
SC₁'s diagnosis:
"The evaluation was written after the fix, so the author wasn't running the checker against the completed EVALUATION.md. This is a workflow gap: self-evaluation only validates the skill body, not the evaluation artifact itself."
SC₁ also found HTML entities in four reference files — >, <, & rendering as literal text in the agent-loaded markdown. SA₁'s validate_skill.py only checked SKILL.md. SC₁ extended the validator to scan all .md files recursively, catching the entities the tool had always missed.
SC₁'s convergence assessment:
"Diminishing returns visible. Each iteration yields smaller gains."
What the Cross-Spiral Reveals
Two kinds of evaluator blindness emerged clearly in Round 3:
You can't see what you don't measure. SA₁ found SC₁'s missing visual artifact and line-count violation because SA's rubric measures those. SC₁ wouldn't have flagged either — they're not in SC's quality framework. Each cross-evaluator finds a different shadow: the shadow of its own values, projected onto the target.
You can't see what you just wrote. SA₁'s EVALUATION.md described the fix in the very language that would trigger the failure. This isn't carelessness — it's structural. You write the evaluation after fixing the files, you reference the paths you fixed, and you don't think to run the checker again on the document you're in the middle of writing. A cross-evaluator reads the finished artifact cold. The self-evaluator never gets that perspective.
Both rounds converged to the same composite score: A−. But they got there by finding different things.
What We Learned
1. Tools inherit their creator's values
Skill-architect was built by someone who values knowledge architecture: layered references, progressive disclosure, shibboleth encoding, Mermaid diagrams. Skill-creator was built by someone who values measurement infrastructure: scripts, benchmarks, automated grading, iteration loops.
When each evaluates the other, they find what's missing from their own perspective. The architect gave skill-creator 3/10 on visual artifacts because diagrams are sacred to the architect. The creator scored skill-architect 5/10 on output quality because assertability is sacred to the creator — if you can't write a test for it, it doesn't exist.
Neither is wrong. They're applying different value systems.
2. Both skills violate their own rules
The architect's SKILL.md is 503 lines. Its own rule says <500.
The creator teaches "pushy descriptions" and "always include a NOT clause." Its own description is neutral with no exclusions.
The creator's quick_validate.py checks for allowed-tools in frontmatter. The creator's own frontmatter doesn't have it.
This isn't a gotcha. This is the fundamental problem of meta-tools: the cobbler's children go barefoot.
3. NOT clauses are contextual, not universal
With 15 skills, NOT clauses are hygiene. With 191 skills, they're architecture. The right answer depends on how many skills are competing for activation in your namespace.
4. The scorecard had a surprise
The architect gave skill-creator 7/10 on shibboleths — its highest score, and higher than skill-architect's own self-evaluation on that dimension (4/10). This is real: skill-creator encodes subtle expertise about when quantitative evals help vs. don't, the overfitting risk in description optimization loops, and the importance of reading transcripts instead of just metrics. The architect recognized expertise it couldn't represent in its own format.
Meanwhile, both evaluators independently found the same description problem: internal vocabulary that users don't type. "Expert-level progressive disclosure" became "my skill doesn't trigger." "Measure skill performance" became "run skill evals." The semantic matching engine doesn't care about your internal vocabulary. Both skills failed this test on themselves.
5. The best meta-skill would be both
An encyclopedia with a factory floor. The architect's knowledge depth (13 reference files, anti-pattern catalogs, shibboleth templates, 23-type Mermaid guide) combined with the creator's measurement infrastructure (9 scripts, 3 evaluation agents, HTML viewer, benchmark aggregation). Neither covers the full space alone.
Both evaluators concluded this independently. SA said SC's tooling is "arguably better" on self-containment. SC said if the two skills "were composed, the combined quality would exceed either alone." Two different philosophies. Same answer.
6. Tools are more honest than text
In Layer 1, each skill described what it valued. In Layer 2, each skill demonstrated what it valued by using its tools.
SA's tools are architectural probes. Running them finds what's structurally absent. SC's tools are consistency probes. Running them finds what's mechanically broken.
Both revealed something text wouldn't. SA's own check_self_contained.py was generating false positives from its own illustrative prose — the skill was failing the very quality gate it ships. SC's aggregate_benchmark.py would always produce empty output because of a missing directory level in the path structure. Neither failure was visible from reading SKILL.md alone. Both were invisible until you ran the code.
The implication: a skill you can't run is less self-contained than it appears.
7. The evaluation artifact is also subject to evaluation
In Round 3, SA₁'s EVALUATION.md — the document that said "check_self_contained.py now passes" — caused check_self_contained.py to fail. The self-evaluator writes the evaluation after fixing the files, references the paths it just fixed, and doesn't think to run the checker against the document it's in the middle of writing.
SC₁, reading the finished output cold, ran the checker and found the failure immediately.
This is a general principle: the act of documenting a fix can reintroduce the problem being fixed. A self-evaluator can't get outside its own output to notice this. A cross-evaluator can.
It also suggests that evaluation infrastructure needs its own testing. SC₁'s fix — adding <!-- phantom-ok --> annotations and extending ILLUSTRATIVE_MARKERS with evaluation-document prose patterns — was itself a form of meta-evaluation: auditing the evaluator's assumptions about its own output.
What Comes Next
Three rounds of cross-evaluation are done. Both evaluators are now at A− and agree they're near convergence. The interesting remaining questions aren't about improving these skills further — they're structural questions about the evaluation process itself.
Does the braid commute? The cross-spiral ran SA₁(SC₁) and SC₁(SA₁) — two directions of the same crossing. Both found different things and both landed at the same grade. But if you computed SA(SC(SA)) vs SC(SA(SC)) from the originals, would they converge to the same point? The two evaluators apply different rubrics and find different failures. The limit might depend on which direction you travel.
What's the fixed point of the composition SA ∘ SC? We have SA(SC₁) = SC₂ and SC(SA₁) = SA₂. But what about (SA ∘ SC)(SA) — applying both evaluators in sequence to the same target? Round 2 and Round 3 applied them independently. What would the composed skill look like if you ran both evaluators together, letting each build on the other's findings?
Can you compose the skills? Both evaluators independently concluded that the ideal meta-skill would have SA's knowledge depth and SC's measurement infrastructure. That's a hypothesis, not a skill. Building it would require merging 13 reference files with 9 evaluation scripts, unifying two different rubrics, and resolving the NOT-clause philosophy difference. That's a design problem, not an evaluation problem.
The experiment continues. Transcripts and diffs are in the eval-data directory. The eval-viewer (shipped as part of skill-creator's tooling) can render the benchmark results as a standalone HTML review page if you want to explore the grading data yourself.
All experiment data, transcripts, and diffs are in the eval-data directory. Anthropic's skill-creator is included under Apache 2.0 with full attribution (see PROVENANCE.md).