skillforge

Minimal runner and eval framework for Claude Skills.

I'm using this to play with multiple versions of the same skill, and to refine them based on evaluation.

⚠️ WARNING ⚠️

This is alpha software written for personal use.

It runs Claude in yolo mode which can and will wipe your data one day.
It can also burn through a ton of tokens depending on what your skills do.
Also, this is pretty much entirely vibecoded, and I have not read the code.

If you burn all your money to brick your computer, it's your own fault.

Usage

Create a folder for experimentation inside your existing project:

npx skillforge create myfolder

This will generate a stub with a skill to test and an example evaluation.

Run benchmarks:

cd myfolder
npx skillforge

You can add more versions of the same skill, or add more skills.

By default, the entire matrix runs 10 times. Repeated runs show cached results unless you pass --reset or delete the myfolder/results from the disk.

I recommend setting up a Claude session and asking it to check out the /results. Then you can use insights from that convo to refine your eval.md and SKILL.md.

Options

--runs <n>      Number of runs per variant (default: 10)
--reset         Archive old results and start fresh
--reeval        Re-run evaluation on existing workspaces
--cache-only    Show cached results only

skillforge

0.1.0

@gaearon

skillforge

⚠️ WARNING ⚠️

Usage

Options