This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Confirm successful installation by checking the skill directory location:
.cursor/skills/promptfoo-evaluation
Restart Cursor to activate promptfoo-evaluation. Access via /promptfoo-evaluation in your agent's command palette.
β
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation projectnpx promptfoo@latest init
# Run evaluationnpx promptfoo@latest eval# View results in browsernpx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/
βββ promptfooconfig.yaml # Main configuration
βββ prompts/
β βββ system.md # System prompt
β βββ chat.json # Chat format prompt
βββ tests/
β βββ cases.yaml # Test cases
βββ scripts/
βββ metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.jsondescription:"My LLM Evaluation"# Prompts to testprompts:- file://prompts/system.md
- file://prompts/chat.json
# Models to compareproviders:-id: anthropic:messages:claude-sonnet-4-6label: Claude-Sonnet-4.6-id: openai:gpt-4.1label: GPT-4.1# Test casestests: file://tests/cases.yaml
# Concurrency control (MUST be under commandLineOptions, NOT top-level)commandLineOptions:maxConcurrency:2# Default assertions for all testsdefaultTest:assert:-type: python
value: file://scripts/metrics.py:custom_assert
-type: llm-rubric
value:| Evaluate the response quality on a 0-1 scale.threshold:0.7# Output pathoutputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
Specify function with file://path.py:function_name
Return bool, float (score), or dict with pass/score/reason
Access variables via context['vars']
LLM-as-Judge (llm-rubric)
assert:-type: llm-rubric
value:| Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness Score 0.0-1.0 where 0.7+ is passing.
threshold:0.7provider: openai:gpt-4.1# Optional: override grader model
When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert:-type: llm-rubric
value:| Evaluate quality on a 0-1 scale.threshold:0.7provider:id: anthropic:messages:claude-sonnet-4-6config:apiBaseUrl: https://your-relay.example.com/api
Best practices:
Provide clear scoring criteria
Use threshold to set minimum passing score
Default grader uses available API keys (OpenAI β Anthropic β Google)
When using relay/proxy: every llm-rubric must have its own provider with apiBaseUrl β the main provider's apiBaseUrl is NOT inherited
Common Assertion Types
Type
Usage
Example
contains
Check substring
value: "hello"
icontains
Case-insensitive
value: "HELLO"
equals
Exact match
value: "42"
regex
Pattern match
value: "\\d{4}"
python
Custom logic
value: file://script.py
llm-rubric
LLM grading
value: "Is professional"
latency
Response time
threshold: 1000
File References
All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file β the file:// paths inside that test file still resolve from the config root.
# Load file content as variablevars:content: file://data/input.txt
β
Make data-driven prioritization decisions faster
Stakeholder Communication
Draft PRDs, status updates, and stakeholder presentations
βΊAccess to product documentation and roadmap tools (Jira, Notion, etc.)
βΊUnderstanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
βΊStakeholder contact information and communication channels
Time Estimate
30-60 minutes to see productivity improvements
Steps
1Install product management skill
2Start with user story generation for known feature
3Progress to competitive analysis: research 2-3 competitors
4Use for roadmap prioritization: apply RICE/ICE scoring
5Draft stakeholder communications and refine based on feedback
6Build template library for recurring PM tasks
7Share effective prompts with product team
Common Pitfalls
β Not validating competitive researchβverify facts before sharing
β Accepting user stories without involving engineering team
β Over-relying on frameworks without qualitative judgment
β Not customizing outputs to company culture and communication style
β Skipping stakeholder validation of generated requirements
Best Practices
β Do
+Validate research and competitive analysis with real data
+Collaborate with engineering when generating technical requirements
+Customize frameworks and templates to your company context
+Use skill for first drafts, refine with stakeholder input
+Document successful prompt patterns for PM tasks
+Combine AI efficiency with human judgment and intuition
β Don't
βDon't publish competitive analysis without fact-checking
βDon't finalize user stories without engineering review
βDon't make prioritization decisions solely on AI scoring
βDon't skip customer validation of generated requirements
βDon't ignore company-specific context and culture
π‘ Pro Tips
β Provide context: company goals, constraints, customer feedback
β Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
β Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
β Use skill for 70% generation + 30% customization to company needs
When to Use This
β Use when
Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.
β Avoid when
Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.
Learning Path
1Basic: user stories, feature specs, status updates