evaluation▌
9 indexed skills · max 10 per page
evaluation
sickn33/antigravity-awesome-skills · Productivity
Build evaluation frameworks for agent systems
llm-evaluation
sickn33/antigravity-awesome-skills · AI/ML
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
promptfoo-evaluation
daymade/claude-code-skills · Productivity
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
hugging-face-evaluation
huggingface/skills · Productivity
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
customaize-agent:agent-evaluation
neolabhq/context-engineering-kit · AI/ML
Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
agent-evaluation
davila7/claude-code-templates · Productivity
Behavioral testing and reliability metrics for LLM agents, catching production failures benchmarks miss. \n \n Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing \n Emphasizes statistical test evaluation (multiple runs, result distribution analysis) and behavioral contract testing over single-run or string-matching approaches \n Includes adversarial testing patterns to actively probe agent failure modes and ident
agent-evaluation
sickn33/antigravity-awesome-skills · Productivity
Framework for testing LLM agents across behavioral, capability, and reliability dimensions with production-focused evaluation patterns. \n \n Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing \n Emphasizes statistical test evaluation (multiple runs with distribution analysis) and behavioral contract testing over single-run or string-matching approaches \n Includes adversarial testing patterns and guards against
agent-evaluation
supercent-io/skills-template · Productivity
Comprehensive evaluation framework for designing, building, and monitoring AI agent performance across coding, conversational, research, and computer-use agents. \n \n Covers three grader types (code-based, model-based, human) with trade-offs and best practices for each agent category \n Provides an 8-step roadmap from initial task creation through production monitoring, including environment isolation, outcome-focused grading, and saturation detection \n Includes benchmarks for major agent type
llm-evaluation
wshobson/agents · AI/ML
Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing. \n \n Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring \n Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom me