explainx.ainewsletter3.4k
trending🔥loopsskills
pricing
workshops ↗
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses — plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join · $29/moUpcoming workshop

learn

platform · $29/moupcoming workshopworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter · weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

© 2026 AISOLO Technologies Pvt Ltd

skills/tag/eval
tag

eval▌

8 indexed skills · max 10 per page

skills (8)

agentic-eval

github/awesome-copilot · Productivity

2

Iterative evaluation and refinement patterns for improving AI agent outputs through self-critique loops. \n \n Provides three core patterns: basic reflection (self-critique loops), evaluator-optimizer (separated generation and evaluation), and code-specific test-driven refinement \n Supports multiple evaluation strategies including outcome-based assessment, LLM-as-judge comparison, and rubric-based scoring with weighted dimensions \n Includes practical Python implementations with structured JSON

adk-eval-guide

google/adk-docs · Frontend

1

Comprehensive evaluation methodology guide for ADK agents covering metrics, schemas, and iteration workflows. \n \n Provides eight evaluation criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety) with configurable thresholds and judge model options \n Includes evalset schema documentation with multi-turn conversation support, tool use trajectory specification, and session state initialization patterns \n Outlines the eval-fix loop: start small, run

digital-health-clinical-asr-eval

nvidia/skills · digital-health

0

Stage 3 of Clinical ASR Flywheel. Score a NeMo manifest, produce the five-section KER leaderboard (by-ipa_source diagnostic). Not for ASR auth (/riva-asr).

eval

alirezarezvani/claude-skills · Productivity

0

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

agent-eval

affaan-m/everything-claude-code · Productivity

0

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

eval-driven-dev

github/awesome-copilot · Productivity

0

You're building an automated QA pipeline that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

eval-audit

hamelsmu/evals-skills · Productivity

0

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

eval-harness

affaan-m/everything-claude-code · Productivity

0

Formal evaluation framework for Claude Code sessions implementing eval-driven development principles. \n \n Defines capability and regression evals with pass/fail criteria before implementation, treating evals as unit tests for AI-assisted workflows \n Supports three grader types: code-based (deterministic checks via bash/grep), model-based (Claude-as-judge), and human review for manual adjudication \n Tracks reliability with pass@k metrics (success within k attempts) and pass^k (all k trials su