eval▌
alirezarezvani/claude-skills · updated Apr 8, 2026
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
/hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
Usage
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
What It Does
Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
LLM Judge Mode (no eval command, or --judge flag)
For each agent:
- Get the diff:
git diff {base_branch}...{agent_branch} - Read the agent's result post from
.agenthub/board/results/agent-{i}-result.md - Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
Hybrid Mode
- Run metric evaluation first
- If top agents are within 10% of each other, use LLM judge to break ties
- Present both metric and qualitative rankings
After Eval
- Update session state:
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
- Tell the user:
- Ranked results with winner highlighted
- Next step:
/hub:mergeto merge the winner - Or
/hub:merge {session-id} --agent {winner}to be explicit
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★74 reviews- ★★★★★Chaitanya Patil· Dec 24, 2024
eval reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Anaya Martin· Dec 24, 2024
I recommend eval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Tariq Flores· Dec 24, 2024
Solid pick for teams standardizing on skills: eval is focused, and the summary matches what you get after install.
- ★★★★★Amina Gonzalez· Dec 20, 2024
Useful defaults in eval — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Arjun Gonzalez· Dec 16, 2024
eval reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Chen Lopez· Dec 12, 2024
We added eval from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Piyush G· Nov 15, 2024
I recommend eval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Liam Gonzalez· Nov 15, 2024
eval reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Kofi Jain· Nov 11, 2024
Registry listing for eval matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Amina Chawla· Nov 7, 2024
I recommend eval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
showing 1-10 of 74