empirical-systems-evaluation

Benchmark multi-agent coordination systems with experiment design, power analysis, human rating protocols, bootstrap confidence intervals, and reproducible reporting. Use for salvage latency, recovery fidelity, coordination overhead, and protocol comparison studies. NOT for ML model benchmarking, web-product A/B testing, survey design, or general-purpose data science.

Uncategorized

Share this skill

Twitter LinkedIn

Skills use the open SKILL.md standard — the same file works across all platforms.

Install all 551 skills as a plugin

claude plugin marketplace add curiositech/windags-skills claude plugin install windags-skills

Claude activates empirical-systems-evaluation automatically when your task matches its description.

View on GitHub