Provides eight evaluation criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety) with configurable thresholds and judge model options
Includes evalset schema documentation with multi-turn conversation support, tool use trajectory specification, and session state initialization patterns
Confirm successful installation by checking the skill directory location:
.cursor/skills/adk-eval-guide
Restart Cursor to activate adk-eval-guide. Access via /adk-eval-guide in your agent's command palette.
β
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Scaffolded project? If you used /adk-scaffold, you already have make eval, tests/eval/evalsets/, and tests/eval/eval_config.json. Start with make eval and iterate from there.
User simulation + hallucinations_v1 / safety_v1 (see references/user-simulation.md)
Multimodal input (image, audio, file)
tool_trajectory_avg_score + custom metric for response quality (see references/multimodal-eval.md)
For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md.
Running Evaluations
# Scaffolded projects:makeevalEVALSET=tests/eval/evalsets/my_evalset.json
# Or directly via ADK CLI:adk eval ./app <path_to_evalset.json>--config_file_path=<path_to_config.json>--print_detailed_results# Run specific eval cases from a set:adk eval ./app my_evalset.json:eval_1,eval_2
# With GCS storage:adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals
Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
Full example
{"criteria":{"tool_trajectory_avg_score":{"threshold":1.0,"match_type":"IN_ORDER"},"final_response_match_v2":{"threshold":0.8,"judge_model_options":{"judge_model":"gemini-2.5-flash","num_samples":5}},"rubric_based_final_response_quality_v1":{"threshold":0.8,"rubrics":[{"rubric_id":"professionalism","rubric_content":{"text_property":"The response must be professional and helpful."}},{"rubric_id":"safety","rubric_content":{"text_property":"The agent must NEVER book without asking for confirmation."}}]}}}
Simple threshold shorthand is also valid: "response_match_score": 0.8
For custom metrics, judge_model_options details, and user_simulator_config, see references/criteria-guide.md.
EvalSet Schema (evalset.json)
{"eval_set_id":"my_eval_set","name":"My Eval Set","description":"Tests core capabilities","eval_cases":[{"eval_id":"search_test","conversation":[{"invocation_id":"inv_1","user_content":{"parts":[{"text":"Find a flight to NYC"}]},"final_response":{"role":"model","parts":[{"text":"I found a flight for $500. Want to book?"}]},"intermediate_data":{"tool_uses":[{"name":"search_flights","args":{"destination":"NYC"}}],"intermediate_responses":[["sub_agent_name",[{"text":"Found 3 flights to NYC."}]]]}}],"session_input":{"app_name":"my_app","user_id":"user_1","state":{}}}]}
session_input.state β initial session state (overrides Python-level initialization)
conversation_scenario β alternative to conversation for user simulation (see references/user-simulation.md)
Common Gotchas
The Proactivity Trajectory Gap
LLMs often perform extra actions not asked for (e.g., google_search after save_preferences). This causes tool_trajectory_avg_score failures with EXACT match. Solutions:
Use IN_ORDER or ANY_ORDER match type β tolerates extra tool calls between expected ones
Include ALL tools the agent might call in your expected trajectory
Use rubric_based_tool_use_quality_v1 instead of trajectory matching
Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."
Multi-turn conversations require tool_uses for ALL turns
The tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
{"conversation":[{"invocation_id":"inv_1","user_content":{"parts":[{"text":"Find me a flight from NYC to London"}]},"intermediate_data":{"tool_uses":[{"name":"search_flights","args":{"origin":"NYC","destination":"LON"}}]}},{"invocation_id":"inv_2","user_content":{"parts":[{"text":"Book the first option"}]},"final_response":{"role":
Implementation Guide
Prerequisites
βΊClaude Desktop or compatible AI client with skill support
βΊClear understanding of task or problem to solve
βΊWillingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Steps
1Install skill using provided installation command
2Test with simple use case relevant to your work
3Evaluate output quality and relevance
4Iterate on prompts to improve results
5Integrate into regular workflow if valuable
Common Pitfalls
β Expecting perfect results without iteration
β Not providing enough context in prompts
β Using skill for tasks outside its intended scope
β Accepting outputs without review and validation
Best Practices
β Do
+Start with clear, specific prompts
+Provide relevant context and constraints
+Review and refine all outputs before using
+Iterate to improve output quality
+Document successful prompt patterns
β Don't
βDon't use without understanding skill limitations
βDon't skip validation of outputs
βDon't share sensitive information in prompts
βDon't expect skill to replace human judgment
π‘ Pro Tips
β Be specific about desired format and style
β Ask for multiple options to choose from
β Request explanations to understand reasoning
β Combine AI efficiency with human expertise
When to Use This
β Use when
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
β Avoid when
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path
1Familiarize yourself with skill capabilities and limitations
2Start with low-risk, non-critical tasks
3Progress to more complex and valuable use cases
4Build expertise through regular use and experimentation