Skill Creator
A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
- Use the
eval-viewer/generate_review.py script to show the user the results for them to look at, and also let them look at the quantitative metrics
- Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.
Cool? Cool.
Communicating with the user
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
- "evaluation" and "benchmark" are borderline, but OK
- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
Using AskUserQuestion (Critical β Read This)
Use the AskUserQuestion tool aggressively at every decision point. Do not ask open-ended text questions in conversation when structured choices exist. This is the single biggest UX improvement you can make β users juggle multiple windows and may not have looked at this conversation in 20 minutes.
Every AskUserQuestion MUST follow this structure:
- Re-ground: State the skill name, current phase, and what just happened (1-2 sentences). The user may have context-switched away.
- Simplify: Explain the decision in plain language. No function names or internal jargon. Say what it DOES, not what it's called.
- Recommend: Lead with your recommendation and a one-line reason why. If options involve effort, show both scales:
(human: ~X min / Claude: ~Y min).
- Options: Provide 2-4 concrete, lettered choices. Each option should be a clear action, not an abstract concept.
Rules:
- One decision per question β never batch unrelated choices
- Provide an escape hatch ("Other" is always implicit in AskUserQuestion)
- Accept the user's choice β nudge on tradeoffs but never refuse to proceed
- Skip the question if there's an obvious answer with no tradeoffs (just state what you'll do)
Creating a skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first β the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
- What should this skill enable Claude to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
After extracting answers from conversation history (or asking questions 1-3), use AskUserQuestion to confirm the skill type and testing strategy:
Creating skill "[name]" β here's what I understand so far:
- Purpose: [1-sentence summary]
- Triggers on: [key phrases]
- Output: [format]
RECOMMENDATION: [Objective/Subjective/Hybrid] skill β [suggested testing approach]
Options:
A) Objective output (files, code, data) β set up automated test cases (Recommended if output is verifiable)
B) Subjective output (writing, design) β qualitative human review only
C) Hybrid β automated checks for structure, human review for quality
D) Skip testing for now β just build the skill and iterate by feel
This upfront classification drives the entire evaluation strategy downstream. Get it right here to avoid wasted effort later.
Prior Art Research (Do Not Skip)
The user's private methodology β their domain rules, workflow decisions, competitive edge β is what makes a skill valuable. No public repo can provide that. But the user shouldn't waste time reinventing infrastructure (API clients, auth flows, rate limiting) when mature tools exist. Prior art research finds building blocks for the infrastructure layer so the skill can focus on encoding the user's unique methodology.
Search these channels in order (use subagents for 4-8 in parallel):
| Priority |
Channel |
What to search |
How |
| 1 |
Conversation history |
User's proven workflows, verified API patterns, corrections made during debugging |
Grep recent conversations for the service/API name |
| 2 |
Local documents & SOPs |
User's private methodology, runbooks, existing skills |
Search project directory, ~/.claude/CLAUDE.md, ~/.claude/references/ |
| 3 |
Installed plugins & MCPs |
Already-integrated tools |
Check ~/.claude/plugins/, parse installed_plugins.json; check ~/.claude.json for configured MCP servers |
| 4 |
skills.sh |
Community skills |
WebFetch https://skills.sh/?q=<keyword> |
| 5 |
Anthropic official plugins |
Official/partner plugins |
WebFetch https://github.com/anthropics/claude-plugins-official/tree/main/plugins and external_plugins directory |
| 6 |
MCP servers on GitHub |
Existing MCP servers for the same API |
WebSearch "<service-name> MCP server site:github.com" |
| 7 |
Official API docs |
The target service's own documentation |
WebSearch "<service-name> API documentation" or WebFetch the docs URL |
| 8 |
npm / PyPI |
SDK or CLI packages |
npm search <keyword> or curl https://pypi.org/pypi/<name>/json |
Channels 1-3 surface the user's own proven patterns and existing integrations. Channels 4-8 find public infrastructure. The user's private SOP always takes precedence β public tools are building blocks, not replacements. In competitive domains (finance, trading, proprietary operations), the valuable methodology will never be public.
If a public MCP server or skill is found, clone it and verify β don't trust the README:
- Read the actual source code β many projects have polished READMEs on hollow codebases
- Verify auth method β does it match how the API actually authenticates? (X-Api-Key headers vs Bearer vs OAuth β many get this wrong)
- Check test coverage β zero tests = prototype, not production-grade
- Check maintenance β last commit date, open issue count, response to bug reports
- Check environment compatibility β proxy/network assumptions, hardcoded DNS/IPs, region locks
- Check license β MIT/Apache is fine; GPL/SSPL may conflict with proprietary use
- Check dependency weight β huge dependency trees create conflict and security surface
Decision matrix:
| Finding |
Action |
| Mature MCP/SDK handles the infrastructure |
Adopt it, build on top β install the MCP, then build the skill as a workflow layer encoding the user's methodology |
| Partial MCP or SDK exists |
Extend β use for infrastructure, fill gaps in the skill |
| Public skill covers the same domain |
Use for structural inspiration only β public skills in competitive domains are generic by definition. The user's edge is their private SOP |
| Nothing public exists |
Build from scratch β validate API access patterns work (auth, endpoints, proxy) before writing the full skill |
| Integration cost > build cost |
Build it β a 2-hour custom implementation you own beats a "mature" tool with integration friction and upstream risk |
After research completes, present findings via AskUserQuestion:
Research complete for "[skill-name]". Here's what I found:
[1-2 sentence summary of what exists publicly]
RECOMMENDATION: [ADOPT / EXTEND / BUILD] because [one-line reason]
Options:
A) Adopt [tool/MCP X] for infrastructure, build methodology layer on top (Recommended)
B) Extend [partial tool Y] β use what works, fill gaps in the skill
C) Build from scratch β nothing found matches well enough
D) Show me the detailed findings before I decide
When in doubt, bias toward adopting mature infrastructure for the plumbing layer and building custom logic for the methodology layer β that's where the value lives.
Interview and Research
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
Write the SKILL.md
Based on the user interview, fill in these components:
- name: Skill identifier
- description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
- compatibility: Required tools, dependencies (optional, rarely needed)
- the rest of the skill :)
Skill Writing Guide
Anatomy of a Skill
skill-name/
βββ SKILL.md (required)
β βββ YAML frontmatter (name, description required)
β βββ Markdown instructions
βββ Bundled Resources (optional)
βββ scripts/ - Executable code for deterministic/repetitive tasks
βββ references/ - Docs loaded into context as needed
βββ assets/ - Files used in output (templates, icons, fonts)
YAML Frontmatter Reference
All frontmatter fields except description are optional. Configure skill behavior using these fields between --- markers:
---
name: my-skill
description: What this skill does and when to use it. Use when...
context: fork
agent: general-purpose
argument-hint: [topic]
---
| Field |
Required |
Description |
name |
No |
Display name for the skill. If omitted, uses the directory name. Lowercase letters, numbers, and hyphens only (max 64 characters). |
description |
Recommended |
What the skill does and when to use it. Claude uses this to decide when to apply the skill. If omitted, uses the first paragraph of markdown content. |
context |
No |
Set to fork to run in a forked subagent context. See "Inline vs Fork: Critical Decision" below β choosing wrong breaks your skill. |
agent |
No |
Which subagent type to use when context: fork is set. Options: Explore, Plan, general-purpose, or custom agents from .claude/agents/. Default: general-purpose. |
disable-model-invocation |
No |
Set to true to prevent Claude from automatically loading this skill. Use for workflows you want to trigger manually with /name. Default: false. |
user-invocable |
No |
Set to false to hide from the / menu. Use for background knowledge users shouldn't invoke directly. Default: true. |
allowed-tools |
No |
Pre-approved tools list. Recommendation: Do NOT set this field. Omitting it gives the skill full tool access governed by the user's permission settings. Setting it restricts the skill's capabilities unnecessarily. |
model |
No |
Model to use when this skill is active. |
argument-hint |
No |
Hint shown during autocomplete to indicate expected arguments. Example: [issue-number] or [filename] [format]. |
hooks |
No |
Hooks scoped to this skill's lifecycle. Example: hooks: { pre-invoke: [{ command: "echo Starting" }] }. See Claude Code Hooks documentation. |
Special placeholder: $ARGUMENTS in skill content is replaced with text the user provides after the skill name. For example, /deep-research quantum computing replaces $ARGUMENTS with quantum computing.
Inline vs Fork: Critical Decision
This is the most important architectural decision when designing a skill. Choosing wrong will silently break your skill's core capabilities.
CRITICAL CONSTRAINT: Subagents cannot spawn other subagents. A skill running with context: fork (as a subagent) CANNOT:
- Use the Task tool to spawn parallel exploration agents
- Use the Skill tool to invoke other skills
- Orchestrate any multi-agent workflow
Decision guide:
| Your skill needs to... |
Use |
Why |
| Orchestrate parallel agents (Task tool) |
Inline (no context) |
Subagents can't spawn subagents |
| Call other skills (Skill tool) |
Inline (no context) |
Subagents can't invoke skills |
| Run Bash commands for external CLIs |
Inline (no context) |
Full tool access in main context |
| Perform a single focused task (research, analysis) |
Fork (context: fork) |
Isolated context, clean execution |
| Provide reference knowledge (coding conventions) |
Inline (no context) |
Guidelines enrich main conversation |
| Be callable BY other skills |
Fork (context: fork) |
Must be a subagent to be spawned |
Example: Orchestrator skill (MUST be inline):
---
name: product-analysis
description: Multi-path parallel product analysis with cross-model synthesis
---
1. Auto-detect available tools (which codex, etc.)
2. Launch 3-5 Task agents in parallel (Explore subagents)
3. Optionally invoke /competitors-analysis via Skill tool
4. Synthesize all results
Example: Specialist skill (fork is correct):
---
name: deep-research
description: Research a topic thoroughly using multiple sources
context: fork
agent: Explore
---
Research $ARGUMENTS thoroughly:
1. Find relevant files using Glob and Grep
2. Read and analyze the code
3. Summarize findings with specific file references
Example: Reference skill (inline, no task):
---
name: api-conventions
description: API design patterns for this codebase
---