phoenix-evals

arize-ai/phoenix · updated Apr 8, 2026

$npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evals
0 commentsdiscussion
summary

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

skill.md

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task Files
Setup setup-python, setup-typescript
Decide what to evaluate evaluators-overview
Choose a judge model fundamentals-model-selection
Use pre-built evaluators evaluators-pre-built
Build code evaluator evaluators-code-python, evaluators-code-typescript
Build LLM evaluator evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrame evaluate-dataframe-python
Run experiment experiments-running-python, experiments-running-typescript
Create dataset experiments-datasets-python, experiments-datasets-typescript
Generate synthetic data experiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracy validation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for review observe-sampling-python, observe-sampling-typescript
Analyze errors error-analysis, error-analysis-multi-turn, axial-coding
RAG evals evaluators-rag
Avoid common mistakes common-mistakes-python, fundamentals-anti-patterns
Production production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Reference Categories

Prefix Description
fundamentals-* Types, scores, anti-patterns
observe-* Tracing, sampling
error-analysis-* Finding failures
axial-coding-* Categorizing failures
evaluators-* Code, LLM, RAG evaluators
experiments-* Datasets, running experiments
validation-* Validating evaluator accuracy against human labels
production-* CI/CD, monitoring

Key Principles

Principle Action
Error analysis first Can't automate what you haven't observed
Custom > generic Build from your failures
Code first Deterministic before LLM
Validate judges >80% TPR/TNR
Binary > Likert Pass/fail, not 1-5

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.743 reviews
  • Advait Lopez· Dec 28, 2024

    Registry listing for phoenix-evals matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Noah Ghosh· Dec 16, 2024

    Keeps context tight: phoenix-evals is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Valentina Abebe· Dec 12, 2024

    phoenix-evals has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Pratham Ware· Dec 8, 2024

    Solid pick for teams standardizing on skills: phoenix-evals is focused, and the summary matches what you get after install.

  • Sakshi Patil· Nov 27, 2024

    We added phoenix-evals from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Lucas Desai· Nov 19, 2024

    Keeps context tight: phoenix-evals is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Soo Patel· Nov 11, 2024

    Useful defaults in phoenix-evals — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Kiara Kapoor· Nov 7, 2024

    Registry listing for phoenix-evals matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Kiara Jain· Nov 3, 2024

    phoenix-evals fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Valentina Khan· Oct 26, 2024

    phoenix-evals reduced setup friction for our internal harness; good balance of opinion and flexibility.

showing 1-10 of 43

1 / 5