Minimal runner and eval framework for Claude Skills.
I'm using this to play with multiple versions of the same skill, and to refine them based on evaluation.
This is alpha software written for personal use.
If you burn all your money to brick your computer, it's your own fault.
Create a folder for experimentation inside your existing project:
npx skillforge create myfolder
This will generate a stub with a skill to test and an example evaluation.
Run benchmarks:
cd myfolder
npx skillforge
You can add more versions of the same skill, or add more skills.
By default, the entire matrix runs 10 times. Repeated runs show cached results unless you pass --reset or delete the myfolder/results from the disk.
I recommend setting up a Claude session and asking it to check out the /results. Then you can use insights from that convo to refine your eval.md and SKILL.md.
--runs <n> Number of runs per variant (default: 10)
--reset Archive old results and start fresh
--reeval Re-run evaluation on existing workspaces
--cache-only Show cached results only