tag

eval▌

7 indexed skills · max 10 per page

skills (7)

agentic-eval

github/awesome-copilot · Productivity

Iterative evaluation and refinement patterns for improving AI agent outputs through self-critique loops. \n \n Provides three core patterns: basic reflection (self-critique loops), evaluator-optimizer (separated generation and evaluation), and code-specific test-driven refinement \n Supports multiple evaluation strategies including outcome-based assessment, LLM-as-judge comparison, and rubric-based scoring with weighted dimensions \n Includes practical Python implementations with structured JSON

eval

alirezarezvani/claude-skills · Productivity

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

agent-eval

affaan-m/everything-claude-code · Productivity

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

eval-driven-dev

github/awesome-copilot · Productivity

You're building an automated QA pipeline that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

adk-eval-guide

google/adk-docs · Frontend

Comprehensive evaluation methodology guide for ADK agents covering metrics, schemas, and iteration workflows. \n \n Provides eight evaluation criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety) with configurable thresholds and judge model options \n Includes evalset schema documentation with multi-turn conversation support, tool use trajectory specification, and session state initialization patterns \n Outlines the eval-fix loop: start small, run

eval-audit

hamelsmu/evals-skills · Productivity

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

eval-harness

affaan-m/everything-claude-code · Productivity

Formal evaluation framework for Claude Code sessions implementing eval-driven development principles. \n \n Defines capability and regression evals with pass/fail criteria before implementation, treating evals as unit tests for AI-assisted workflows \n Supports three grader types: code-based (deterministic checks via bash/grep), model-based (Claude-as-judge), and human review for manual adjudication \n Tracks reliability with pass@k metrics (success within k attempts) and pass^k (all k trials su