tag

evaluation▌

9 indexed skills · max 10 per page

skills (9)

evaluation

sickn33/antigravity-awesome-skills · Productivity

Build evaluation frameworks for agent systems

llm-evaluation

sickn33/antigravity-awesome-skills · AI/ML

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

promptfoo-evaluation

daymade/claude-code-skills · Productivity

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

hugging-face-evaluation

huggingface/skills · Productivity

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

customaize-agent:agent-evaluation

neolabhq/context-engineering-kit · AI/ML

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

agent-evaluation

davila7/claude-code-templates · Productivity

Behavioral testing and reliability metrics for LLM agents, catching production failures benchmarks miss. \n \n Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing \n Emphasizes statistical test evaluation (multiple runs, result distribution analysis) and behavioral contract testing over single-run or string-matching approaches \n Includes adversarial testing patterns to actively probe agent failure modes and ident

agent-evaluation

sickn33/antigravity-awesome-skills · Productivity

Framework for testing LLM agents across behavioral, capability, and reliability dimensions with production-focused evaluation patterns. \n \n Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing \n Emphasizes statistical test evaluation (multiple runs with distribution analysis) and behavioral contract testing over single-run or string-matching approaches \n Includes adversarial testing patterns and guards against

agent-evaluation

supercent-io/skills-template · Productivity

Comprehensive evaluation framework for designing, building, and monitoring AI agent performance across coding, conversational, research, and computer-use agents. \n \n Covers three grader types (code-based, model-based, human) with trade-offs and best practices for each agent category \n Provides an 8-step roadmap from initial task creation through production monitoring, including environment isolation, outcome-focused grading, and saturation detection \n Includes benchmarks for major agent type

llm-evaluation

wshobson/agents · AI/ML

Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing. \n \n Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring \n Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom me