WINDAGs Blog

Technical deep dives into multi-agent orchestration, skill engineering, and the craft of building systems that learn.

Vanilla Sonnet vs Sonnet + grafted skills: B wins 74%. Structured planning: D wins 88%.

April 28, 2026

Skills Actually Help: The Numbers

We ran 50 senior-engineer prompts through vanilla Sonnet and Sonnet-with-grafted-skills, then had Opus judge the answers blind. Skill grafting won 74% of decided pairs. Structured planning won 88%. Here's the full study, the methodology, and the categories where it made the biggest difference.

A reasonable question: does any of this matter?

benchmarksskillsevaluationretrievalanthropic

Read full post

npx -y @workgroup-ai/mcp-server. That's the install.

April 27, 2026

Graft Skills Into Any Agent: The WinDAGs MCP

One install. 503+ skills. Your agent goes from generic to specialist on demand. The WinDAGs MCP server exposes skill_search, skill_graft, and skill_reference to Claude Code, Claude Desktop, Cursor, Codex, and any MCP client. Zero API keys for the read-only tools.

It's Saturday. You're four hours into a side project. You ask your AI to "set up a Stripe webhook handler with idempotency and a retry policy that doesn't blow up under bursts." It writes you something that looks fine. It misses the signature verification, picks a backoff strategy that'll silently double-charge, and uses an API shape that was renamed in 2024.

mcpskillsclaudecursoragents

Read full post

Skill Quality · Part 3

503 skills. Five feedback signals. One side-effect registry. Skills that get sharper as you use them.

April 27, 2026

Skills That Improve Themselves: Attribution, Lifecycle, and Fleet Agents

Skills aren't static documents. Every graft, every accept/reject, every downstream rating feeds an attribution system that decides which skills to sharpen, deprecate, or split. Here's the SQLite-backed scoring, the side-effect registry, and the fleet of always-on agents that keep 503 skills coherent.

If you've ever rolled out an AI coding tool to your team, you know the half-life problem. The first month, everyone's quoting it in standups. Three months later, someone notices it's confidently giving 2023 React advice on a 2026 codebase, and the trust starts evaporating. Six months later it's the thing nobody admits to using.

skillsattributionlifecyclefleet-agentsfeedback

Read full post

Skill Quality · Part 2

Five stages. One pipeline. Every caller goes through it.

April 27, 2026

The Skill Matching Cascade: How WinDAGs Picks the Right Expert

We replaced keyword matching with a five-stage retrieval cascade — BM25, Tool2Vec, RRF fusion, cross-encoder rerank, attribution k-NN. Here's why each stage exists, what it costs, and how to tell which one is doing the work.

You ask your AI to "design a paginated GraphQL endpoint that won't fall over at scale." It reaches for the textbook answer — LIMIT and OFFSET. Which works fine, right up until your dataset hits a million rows and every page-2 request starts scanning the whole table.

skillsretrievalbm25tool2vecembeddings

Read full post

Skill Quality · Part 2

$0.14 per book. 88% compression. Zero decision-critical information lost.

March 25, 2026

How to Distill a Book Into a Skill

We turned 77 books and research papers into AI agent skills — at $0.14 per book. Here's the 3-pass pipeline, the 88% compression ratio, the five ways to screw it up, and what you get when it works.

Every team has the senior engineer who read the book. The one who can tell you, in twenty seconds, why you should not use that pattern, where the gotchas are, and which page the load-bearing argument starts on. They're an institutional asset. They're also a single point of failure — when they leave, the team relearns everything the hard way.

skillsknowledge-engineeringdistillationbooksmethodology

Read full post

Skill Quality · Part 1

80% concepts. 20% procedure. We measured it.

March 23, 2026

Why Declarative Knowledge Isn't Enough: The Procedural Gap in AI Agent Skills

We audited 469 AI agent skills and found 80% of the content is declarative knowledge — concepts, definitions, terminology. Only 20% is procedural — the decision trees, failure modes, and quality gates that let agents actually execute. Here's the cognitive science, the data, and the tools to fix it.

You roll out an AI tooling guide to the team. It's well-organized, well-written, has examples. Three weeks later, you watch a junior dev follow it step-by-step and produce something that compiles, passes the tests you wrote, and is quietly wrong in a way only a senior would catch on review. The guide explained what things are. It did not say how to decide what to do when the obvious answer is the wrong one.

skillscognitive-scienceknowledge-engineeringqualityevaluation

Read full post

Type /next-move. Get a parallelized execution plan. Accept, modify, or reject.

March 15, 2026

/next-move: Your AI Already Knows What's Next

We built a slash command that reads your project's git state, conversation context, and skill catalog — then predicts the highest-impact sequence of agents to run. Zero API cost. One command.

It's Tuesday at 9:42 AM. You open the laptop. The repo is right there. You stare at it.

skillsmeta-dagplanningnext-movedecision-making

Read full post

Skill Compression · Part 2

Less is more — especially when Claude already knows.

March 15, 2026

WinDAGSZip — Compressing Skills with Embeddings

A 23MB embedding model finds 25% of tokens are self-duplicating in code-heavy skills — for free, no API calls. An LLM judge finds another 20-40% overlaps with training data. We built the tool, measured everything across 10 skills, and shipped it.

It's late. You've been on a long agent session. Your agent took twelve seconds to think before answering an easy question and you watched the spinner. The thing is loaded with skill prompts — 200-line expert documents, dozens of them, every one of them eating context before the model can even read the question. And you start to wonder how much of those tokens is the skill teaching the model something it doesn't know, vs. the skill telling the model things it had already seen a thousand times during pretraining.

skillscompressionembeddingsevaluationrate-distortion

Read full post

What happens when two skill-improvers improve each other?

March 6, 2026

When Meta-Skills Collide

We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.

TL;DR: We ran two competing AI skill-evaluation tools against each other — Anthropic's open-source skill-creator and our skill-architect. Both violated their own rules. Neither caught what the other caught. Self-evaluation has a structural blind spot that cross-evaluation fills. The experiment converged to A- from both directions.

skillsmetaanthropicevaluationrecursion

Read full post

Skill Compression · Part 1

163 skills were missing the same section.

March 5, 2026

The 191-Skill Quality Pass

We graded every Claude Code skill in our library against a 10-axis rubric. 163 were missing the same section. We fixed all of them — with hand-crafted content, not templates. Here's the rubric, the data, and the one section every skill should have.

We have 191 Claude Code skills. They cover everything from Drizzle migrations to Jungian psychology, from drone inspection to wedding photography, from pixel art to HIPAA compliance.

skillsqualityevaluationmetaautomation

Read full post