agent-eval▌
affaan-m/everything-claude-code · updated Apr 8, 2026
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Agent Eval Skill
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
When to Activate
- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
- Measuring agent performance before adopting a new tool or model
- Running regression checks when an agent updates its model or tooling
- Producing data-backed agent selection decisions for a team
Installation
Note: Install agent-eval from its repository after reviewing the source.
Core Concepts
YAML Task Definitions
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Git Worktree Isolation
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
Metrics Collected
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Workflow
1. Define Tasks
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
2. Run Agents
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
- Creates a fresh git worktree from the specified commit
- Hands the prompt to the agent
- Runs the judge criteria
- Records pass/fail, cost, and time
3. Compare Results
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
Judge Types
Code-Based (deterministic)
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
Pattern-Based
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
Model-Based (LLM-as-judge)
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
Best Practices
- Start with 3-5 tasks that represent your real workload, not toy examples
- Run at least 3 trials per agent to capture variance — agents are non-deterministic
- Pin the commit in your task YAML so results are reproducible across days/weeks
- Include at least one deterministic judge (tests, build) per task — LLM judges add noise
- Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
- Version your task definitions — they are test fixtures, treat them as code
Links
- Repository: github.com/joaquinhuigomez/agent-eval
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★44 reviews- ★★★★★Shikha Mishra· Dec 28, 2024
Registry listing for agent-eval matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Ganesh Mohane· Dec 24, 2024
agent-eval has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Nia Ndlovu· Dec 20, 2024
I recommend agent-eval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Alexander Farah· Dec 16, 2024
Keeps context tight: agent-eval is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Min Huang· Dec 8, 2024
Useful defaults in agent-eval — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Yash Thakker· Nov 19, 2024
agent-eval reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Mei Dixit· Nov 19, 2024
agent-eval has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Carlos Tandon· Nov 11, 2024
Keeps context tight: agent-eval is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Min Zhang· Nov 7, 2024
I recommend agent-eval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Aarav Gupta· Oct 26, 2024
agent-eval reduced setup friction for our internal harness; good balance of opinion and flexibility.
showing 1-10 of 44