darkrishabh/agent-skills-eval
darkrishabh/agent-skills-evalA test runner for agentskills.io-style AI agent skills
From the README
agent-skills-eval
A test runner for Agent Skills.
Write a SKILL.md, drop in some evals, and find out — empirically — whether your skill actually makes the model better at the task.
Documentation · Quickstart · SDK · agentskills.io
Why this exists
Agent Skills — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a SKILL.md and assume your agent is now better at the task. The hard part is proving it.
agent-skills-eval is the missing piece. It runs your skill against the same prompts twice — once with_skill loaded into context, once without_skill (baseline) — has a judge model grade both outputs, and gives you a side-by-side report. If the skill doesn't make a measurable difference, you'll see it. If it does, you have receipts.
It's the test framework for the Agent Skills ecosystem, separated from any specific agent runtime so it works wherever your skills do.
Quickstart
npx agent-skills-eval ./skills \
--target gpt-4o-mini \
--judge gpt-4o-mini \
--baseline \
--strict
That's it. Point it at a folder of skills, give it a target model and a judge model, and it produces a workspace with full artifacts and a static HTML report.
agent-skills-workspace/
└── iteration-1/
├── meta.json # run metadata
├── benchmark.json # rolled-up pass/fail per skill
├── eval-basic/
│ ├── with_skill/ # output, timing, judge grading
│ └── without_skill/ # ↑ same, with the skill stripped
└── report/
└── index.html # the visual report
Open iteration-1/report/index.html and you have a real, evidence-backed answer to "is my skill working?"
What you get
| | |
|---|---|
| with_skill vs without_skill | Every eval runs both ways so you can see the actual lift from the skill — or its absence. |
| Judge-graded outputs | Use any chat model as a judge. Pass/fail with cited assertions, not vibes. |
| TypeScript SDK + CLI | One-liner CLI for CI, full SDK for custom pipelines, custom providers, and dashboards. |
| OpenAI-compatible by default | Works out of the box with OpenAI, Together, Groq, Anthropic via OpenAI-compat layers, local Llama servers — anything that speaks the OpenAI chat API. |
| Tool-call assertions | Deterministic checks for agents that call tools, not just generate text. |
| Portable artifacts | JSON + JSONL all the way down. Run today, diff tomorrow. Plug into your own dashboard. |
| Static HTML reports | A drop-in report site you can publish anywhere — no infrastructure. |
| Fully spec-compliant | Implements the full agentskills.io specification: SKILL.md validation, `evals/evals.json