Calibrate an LLM judge against human judgment.
hamelsmu▌
creator profile · 7 approved skills · page 1 of 1
total skills
7
showing 7 on this page
page installs
0
0 this week
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.
Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
Complete error analysis on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generati…
Build an HTML page that loads traces from a data source (JSON/CSV file), displays one trace at a time with Pass/Fail buttons, a free-text notes field, and Next/Previous navigation…
Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.
Guide the user through reading LLM pipeline traces and building a catalog of how the system fails.