validate-evaluator

hamelsmu/evals-skills · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/hamelsmu/evals-skills --skill validate-evaluator
0 commentsdiscussion
summary

Calibrate an LLM judge against human judgment.

skill.md

Validate Evaluator

Calibrate an LLM judge against human judgment.

Overview

  1. Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
  2. Run judge on dev set and measure TPR/TNR
  3. Iterate on the judge until TPR and TNR > 90% on dev set
  4. Run once on held-out test set for final TPR/TNR
  5. Apply bias correction formula to production data

Prerequisites

  • A built LLM judge prompt (from write-judge-prompt)
  • Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
    • Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
    • Labels must come from a domain expert, not outsourced annotators
  • Candidate few-shot examples from your labeled data

Core Instructions

Step 1: Create Data Splits

Split human-labeled data into three disjoint sets:

Split Size Purpose Rules
Training 10-20% (~10-20 examples) Source of few-shot examples for the judge prompt Only clear-cut Pass and Fail cases. Used directly in the prompt.
Dev 40-45% (~40-45 examples) Iterative evaluator refinement Never include in the prompt. Evaluate against repeatedly.
Test 40-45% (~40-45 examples) Final unbiased accuracy measurement Do NOT look at during development. Used once at the end.

Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.

from sklearn.model_selection import train_test_split

# First split: separate test set
train_dev, test = train_test_split(
    labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
    train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test

Step 2: Run Evaluator on Dev Set

Run the judge on every example in the dev set. Compare predictions to human labels.

Step 3: Measure TPR and TNR

TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?

TPR = (judge says Pass AND human says Pass) / (human says Pass)

TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?

TNR = (judge says Fail AND human says Fail) / (human says Fail)
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.

Step 4: Inspect Disagreements

Examine every case where the judge disagrees with human labels:

Disagreement Type Judge Human Fix
False Pass Pass Fail Judge is too lenient. Strengthen Fail definitions or add edge-case examples.
False Fail Fail Pass Judge is too strict. Clarify Pass definitions or adjust examples.

For each disagreement, determine whether to:

  • Clarify wording in the judge prompt
  • Swap or add few-shot examples from the training set
  • Add explicit rules for the edge case
  • Split the criterion into more specific sub-checks

Step 5: Iterate

Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.

Stopping criteria:

  • Target: TPR > 90% AND TNR > 90%
  • Minimum acceptable: TPR > 80% AND TNR > 80%

If alignment stalls:

Problem Solution
TPR and TNR both low Use a more capable LLM for the judge
One metric low, one acceptable Inspect disagreements for the low metric specifically
Both plateau below target Decompose the criterion into smaller, more atomic checks
Consistently wrong on certain input types Add targeted few-shot examples from training set
Labels themselves seem inconsistent Re-examine human labels; the rubric may need refinement

Step 6: Final Measurement on Test Set

Run the judge exactly once on the held-out test set. Record final TPR and TNR.

Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.

Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)

Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:

theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)

Where:

  • p_obs = fraction of unlabeled traces the judge scored as Pass
  • TPR, TNR = from test set measurement
  • theta_hat = corrected estimate of true success rate

Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).

Example:

  • Judge TPR = 0.92, TNR = 0.88
  • 500 production traces: 400 scored Pass -> p_obs = 0.80
  • theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
  • True success rate is ~85%, not the raw 80%

Step 8: Confidence Interval

Compute a bootstrap confidence interval. A point estimate alone is not enough.

import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")

Or use judgy (pip install judgy):

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

Practical Guidance

  • Pin exact model versions for LLM judges (e.g., gpt-4o-2024-05-13, not gpt-4o). Providers update models without notice, causing silent drift.
  • Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
  • Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
  • One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
  • Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.

Anti-Patterns

  • Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
  • Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
  • Dev/test examples as few-shot examples. This is data leakage.
  • Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
  • Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
  • Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.
how to use validate-evaluator

How to use validate-evaluator on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add validate-evaluator
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/hamelsmu/evals-skills --skill validate-evaluator

The skills CLI fetches validate-evaluator from GitHub repository hamelsmu/evals-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/validate-evaluator

Reload or restart Cursor to activate validate-evaluator. Access the skill through slash commands (e.g., /validate-evaluator) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.669 reviews
  • Kiara Desai· Dec 28, 2024

    I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Kiara Farah· Dec 16, 2024

    Registry listing for validate-evaluator matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Alexander Kapoor· Dec 8, 2024

    Keeps context tight: validate-evaluator is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Kaira Wang· Dec 8, 2024

    validate-evaluator has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Chaitanya Patil· Dec 4, 2024

    I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Alexander Okafor· Dec 4, 2024

    Useful defaults in validate-evaluator — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Amelia Anderson· Dec 4, 2024

    validate-evaluator fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Piyush G· Nov 23, 2024

    Useful defaults in validate-evaluator — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Alexander Ndlovu· Nov 23, 2024

    I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Amelia Smith· Nov 23, 2024

    We added validate-evaluator from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

showing 1-10 of 69

1 / 7