Tool definitions are context. They occupy tokens in the model's context window, they influence attention, and they shape the model's decision about what to do and how to do it. Most teams spend weeks iterating on system prompts and retrieval pipelines while leaving their tool definitions in whatever state they were in at initial implementation.
This is a significant mistake. In production agentic systems, malformed tool definitions cause more failures than bad retrieval or bad prompts — and unlike prompt failures (which affect one response), tool definition failures are systematic across every task that requires the affected tool.
This guide covers how to design tool definitions that produce reliable tool calls in production.
What a tool definition contains
Every major model API (Anthropic Claude, OpenAI, Google Gemini) and the MCP standard use the same conceptual structure for tool definitions, even if the exact field names differ slightly:
{
"name": "search_product_catalog",
"description": "Search the product catalog by name, category, or SKU. Use when the user asks about product availability, pricing, or specifications. Do NOT use for order status — use get_order_status instead.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Can be a product name, partial name, category, or SKU. Example: 'wireless headphones', 'SKU-4892', 'audio'."
},
"category": {
"type": "string",
"enum": ["audio", "video", "accessories", "software"],
"description": "Optional category filter. Omit to search all categories."
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return. Default 5, maximum 20.",
"default": 5
}
},
"required": ["query"]
}
}
Three components drive tool performance: the name, the description, and the parameter schema. All three are context engineering decisions.
Designing the tool name
The tool name is the primary signal the model uses for tool selection. It must be:
Unambiguous. The model should be able to infer from the name alone what the tool does and roughly when to use it. search is worse than search_product_catalog. get_data is worse than get_customer_order_history.
Verb-first. Name tools by the action they perform: search_, create_, update_, delete_, get_, list_, send_, calculate_. This matches how the model reasons about tasks ("I need to search for X") to tool names ("search_product_catalog does this").
Namespace-consistent. If you have multiple tools operating on the same resource, use consistent naming: create_ticket, get_ticket, update_ticket, list_tickets — not create_ticket, fetch_support_issue, change_ticket_status, show_all_tickets. Inconsistent naming forces the model to infer relationships that should be explicit.
Distinct from similar tools. If you have search_product_catalog and search_knowledge_base, the names already signal that these are different resources. If you have search_v1 and search_v2, the model has no way to know which to use.
Writing tool descriptions that work
The description is the highest-leverage field in a tool definition. The model uses it to decide:
- Whether this is the right tool for the current task
- What the tool does and doesn't do
- When to prefer this tool over semantically similar alternatives
A good tool description answers three questions:
What does it do? State the action and the resource. "Searches the product catalog by name, category, or SKU."
When should I use it? Specify the conditions that trigger correct usage. "Use when the user asks about product availability, pricing, or specifications."
What does it NOT do? Explicitly rule out confusable cases. "Do NOT use for order status — use get_order_status instead." This prevents the most common category of tool misuse: calling the right-sounding tool for the wrong purpose.
The before/after test
Before (weak description):
"description": "Gets information about products."
After (strong description):
"description": "Search the product catalog by name, SKU, or category. Use when
the user asks about product availability, pricing, features, or specifications.
Returns up to 20 results matching the query. Do NOT use for checking order
status or shipping — use get_order_status or get_shipment_tracking instead."
The weak description is technically accurate but gives the model nothing to reason about. The strong description specifies the data source, use conditions, return format, and anti-cases.
Include a concrete example
For complex tools, add an example to the description:
"description": "...(all of the above)... Example: to find wireless headphones
under $100, set query='wireless headphones' and use the price filter.
Do not set query='headphones under $100' — the price filter handles pricing,
the query handles name matching."
This is especially valuable for tools where parameter interaction is non-obvious.
Designing the parameter schema
The parameter schema specifies what the model must provide when calling the tool. Good schema design minimizes parameter construction errors.
Name parameters clearly
Parameters inherit the same naming principles as tools: be specific, be consistent, be verb-or-noun-first where appropriate.
| Weak | Strong |
|---|---|
q | search_query |
id | customer_id or order_id (which id?) |
type | notification_type |
data | customer_profile_json |
flag | include_archived |
id is the most common parameter naming mistake. When you have multiple resource types in a system, id is always ambiguous. Use customer_id, order_id, ticket_id — the model will almost never confuse these.
Describe each parameter
The description field on each parameter is not optional. The model constructs parameter values from this description. If the description is missing or vague, the model guesses.
"query": {
"type": "string",
"description": "The search query. Accepts partial names, full names, or SKUs.
Example: 'wireless', 'Bose QuietComfort', 'SKU-4892'.
Avoid long phrases — shorter, specific queries return better results."
}
Include: what values are valid, what format they should be in, and an example. For parameters with a specific format (dates, IDs, etc.), show the format explicitly:
"start_date": {
"type": "string",
"description": "Start date for the report range. Format: YYYY-MM-DD. Example: '2026-01-15'."
}
Use enums for constrained values
When a parameter has a fixed set of valid values, use an enum rather than a string. This eliminates a category of parameter construction errors:
"status": {
"type": "string",
"enum": ["open", "in_progress", "resolved", "closed"],
"description": "Filter tickets by status. Omit to return tickets of all statuses."
}
Without the enum, the model might pass "active" or "pending" — values that are semantically reasonable but wrong for your system.
Set sensible defaults
For optional parameters with common defaults, specify the default in the description:
"max_results": {
"type": "integer",
"description": "Maximum number of results to return. Default 5, maximum 50. Increase to 20 for broad searches, keep at 5 for specific lookups.",
"default": 5
}
Without a stated default, the model either omits the parameter (hoping for a backend default) or guesses a value. State the default explicitly.
Mark required vs optional clearly
Only mark parameters as required if the tool genuinely cannot run without them. Optional parameters must have defaults or meaningful behavior when omitted. If your tool errors when an "optional" parameter is missing in practice, make it required — or fix the tool.
Tool surface minimization
Tool surface is the set of tools exposed to the model in a given context. Minimizing tool surface is one of the highest-leverage context engineering decisions for agentic systems.
Why surface minimization matters
Research on tool selection accuracy shows roughly linear degradation as the number of tools increases, especially when tools are semantically similar. With 5 tools, selection error rates are low. With 20 tools, confusion between similar tools rises significantly. With 50+ tools, the model spends visible attention on tool selection that could be spent on the task.
Token cost is also a factor. A comprehensive tool schema can be 200-500 tokens per tool. 20 tools = 4,000-10,000 tokens consumed before the model has seen the user message. For a 16k context window, this is meaningful. For a 200k context window, it still represents tokens competing with retrieved content and conversation history.
Dynamic tool exposure
For task-specific agents, expose only the tools relevant to the current task:
Email summarization task: expose get_emails, summarize_thread, mark_read. Don't expose create_order, search_product_catalog, get_shipment_status.
Order lookup task: expose get_order, get_order_history, get_shipment_status. Don't expose email tools.
This requires the orchestration layer to maintain a task-type-to-tool-subset mapping. The investment pays off immediately in reduced tool selection errors.
Group related tools under a single schema
For operations on the same resource, consider whether they should be separate tools or parameter variations of a single tool:
// Two tools — model must choose between them
"get_customer", "get_customer_by_email"
// One tool with a parameter variant
"lookup_customer" with parameter "lookup_field": enum["id", "email", "phone"]
Fewer tools = fewer selection decisions = fewer errors. The tradeoff is that single-tool-with-parameters becomes unwieldy when the tools have genuinely different parameter shapes. Use judgment.
Token budget for tool definitions
Tool definitions are static context — they don't change between calls within a session and are natural candidates for prompt caching. Structure your tool definitions to maximize cache hit rates:
- Place tool definitions at the start of the context (after the system prompt, before retrieved content)
- Keep tool definitions stable across calls — don't generate them dynamically if you can avoid it
- Cache the tool definition block as a stable prefix
For Anthropic's API, this means placing the tool definitions in the cacheable prefix. For OpenAI, this aligns with their system message caching. The same tools used across many calls in a session should cost you full tokens once, not on every call.
Verbose vs concise schemas
The tradeoff: verbose schemas with full descriptions are more expensive but produce fewer errors. Concise schemas with terse descriptions are cheaper but degrade model behavior.
The practical balance: be verbose in the description field (this drives tool selection and parameter generation), be concise in parameter names and type annotations (this is machine-readable structure, not model guidance). Don't add documentation comments to the schema JSON — add them to the description fields where the model can actually see them.
Validation and testing
Unlike prompts (which are hard to validate systematically), tool definitions can be tested with repeatable scenarios:
Tool selection tests. Given a set of representative user queries, does the model select the correct tool consistently? Run 20-50 test cases and measure selection accuracy. Anything below 95% on straightforward queries is a signal to improve the tool name or description.
Parameter construction tests. Given a tool call trigger, does the model construct valid parameter values? Test with edge cases: optional parameters omitted, special characters in strings, boundary values for integers.
Anti-case tests. For each tool, identify the most likely confusable case — the query that should call a different tool. Does the model correctly route these away from the current tool? The do NOT use for... text in your description is your primary mitigation; these tests verify it's working.
Error recovery tests. When the tool call fails (bad parameter, missing required field), does the model recover correctly? Agent resilience to tool errors is a system-level property, but you can test whether the model recognizes and responds to error payloads from tools.
Summary: tool definition checklist
For every tool in your agent's context:
- Name is specific, verb-first, and distinct from similar tools
- Description answers: what it does, when to use it, what it does NOT do
- Description includes at least one concrete example for complex tools
- All parameters have descriptions (not just type annotations)
- Constrained parameters use enums
- Defaults are stated explicitly in descriptions
- Required vs optional is correct (no required parameters that have defaults)
- Tool surface is minimized to the tools needed for the current task type
- Tool definitions are placed before dynamic content for cache efficiency
- Selection accuracy validated with a test set before deployment
Tool definitions are context. Design them like context — deliberately, specifically, and with the model's attention in mind.
One final note: tool definitions are not a "set and forget" artifact. As your system evolves — new tools added, old ones deprecated, task types refined — revisit your definitions and re-run your selection and parameter construction tests. A tool definition that was clear for 5 tools becomes ambiguous when the 6th and 7th similar tools are added. Treat tool definitions as living documentation that reflects the current state of your agent's capabilities, not a one-time implementation detail.