phoenix-evals▌
arize-ai/phoenix · updated Apr 8, 2026
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Quick Reference
Workflows
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
Reference Categories
| Prefix | Description |
|---|---|
fundamentals-* |
Types, scores, anti-patterns |
observe-* |
Tracing, sampling |
error-analysis-* |
Finding failures |
axial-coding-* |
Categorizing failures |
evaluators-* |
Code, LLM, RAG evaluators |
experiments-* |
Datasets, running experiments |
validation-* |
Validating evaluator accuracy against human labels |
production-* |
CI/CD, monitoring |
Key Principles
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.7★★★★★43 reviews- ★★★★★Advait Lopez· Dec 28, 2024
Registry listing for phoenix-evals matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Noah Ghosh· Dec 16, 2024
Keeps context tight: phoenix-evals is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Valentina Abebe· Dec 12, 2024
phoenix-evals has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Pratham Ware· Dec 8, 2024
Solid pick for teams standardizing on skills: phoenix-evals is focused, and the summary matches what you get after install.
- ★★★★★Sakshi Patil· Nov 27, 2024
We added phoenix-evals from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Lucas Desai· Nov 19, 2024
Keeps context tight: phoenix-evals is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Soo Patel· Nov 11, 2024
Useful defaults in phoenix-evals — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Kiara Kapoor· Nov 7, 2024
Registry listing for phoenix-evals matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Kiara Jain· Nov 3, 2024
phoenix-evals fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Valentina Khan· Oct 26, 2024
phoenix-evals reduced setup friction for our internal harness; good balance of opinion and flexibility.
showing 1-10 of 43