You wrote a skill file. It works on your machine, with your phrasing, for the one task you tested. Then a teammate tries it with slightly different wording, and the skill never triggers. Or it triggers when it shouldn't. Or it loads, runs, and produces output that looks plausible but breaks the downstream workflow.
This is the norm, not the exception. The gap between a skill that works in a demo and one that holds up across teams, models, and months of real usage is enormous. And the official documentation — while solid on structure — doesn't spend much time on the engineering decisions that close that gap.[1]
This guide covers the practical design patterns behind production-grade skill files. We'll work through metadata engineering for reliable triggering, progressive disclosure architecture for token efficiency, strategies for handling ambiguous input, output format design for both humans and machines, and a testing protocol you can run before shipping.
Your Description Field Is the Product
If the skill never triggers, nothing else matters.
Here's something the docs mention but don't emphasize enough: the description field is the only thing the agent reads when deciding whether to load your skill.[1] The entire markdown body, your carefully crafted instructions, your templates, your validation scripts — none of it exists in the agent's world until after the description has already done its job.
At startup, every skill contributes roughly 100 tokens of metadata to the system prompt. That metadata competes with potentially hundreds of other skills. The agent scans these descriptions and picks the best match for the current task. If your description is vague, generic, or missing key trigger terms, your skill is invisible.[2]
Think of it like a library card catalog. Nobody reads the book first and then checks the catalog. The catalog entry determines whether the book gets pulled from the shelf.
"Helps with documents" — too vague, matches everything and nothing
"I can process PDFs for you" — first person breaks discovery since descriptions are injected into system prompts
"Processes data" — no trigger terms, no specificity on when to activate
"Useful tool for developers" — describes the audience, not the capability
"Extracts text and tables from PDF files, fills forms, merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction."
"Generates descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes."
"Analyzes Excel spreadsheets, creates pivot tables, generates charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files."
"Deploys the application to production via the CI pipeline. Use when the user says deploy, ship, release, or push to prod."
Description field engineering rules
Include both WHAT the skill does and WHEN to use it
"Generates commit messages" tells the agent what. "Use when reviewing staged changes or the user asks for commit help" tells it when. You need both.
Use the vocabulary your users actually use
If users say "deploy" but your description says "provision infrastructure," the skill won't trigger. Mirror natural language.
Stay under 200 characters for Claude Code, under 1024 for the API
Longer descriptions consume more of the metadata budget and may get truncated. Pack maximum signal into minimum space.
Never put trigger conditions in the markdown body
The body is only loaded after the skill triggers. Any 'when to use' guidance in the body cannot influence the triggering decision.
Test with exact user phrases, not your own vocabulary
You know the skill exists. Your users don't. Test with the words they'd naturally use, not the words you'd use as the author.
Progressive Disclosure Architecture
Three tiers that keep your context window lean.
The context window is a shared resource. Your skill competes with the system prompt, conversation history, other skills' metadata, and the user's actual request. Every token you spend on skill content is a token unavailable for reasoning.[6]
Production skills use a three-tier architecture that loads information only when needed:
Tier 1 — Metadata (always loaded, ~100 tokens): The name and description from your YAML frontmatter. This is what the agent uses for skill selection. It's always in memory.[1]
Tier 2 — Entry point (loaded on trigger, target < 500 lines): The markdown body of SKILL.md. Contains the core instructions, workflow steps, and pointers to reference files. Loaded only when the skill is activated.
Tier 3 — References (loaded on demand): Separate files like templates, examples, API docs, or scripts. The agent reads these only when the current task requires them. A skill with 2,000 lines of API documentation in a reference file costs zero tokens for tasks that don't need the API details.
Production skill file structure
treemy-skill/
├── SKILL.md
│ ├── # YAML frontmatter (Tier 1: metadata)
│ └── # Markdown body (Tier 2: instructions, < 500 lines)
├── reference/
│ ├── api-docs.md
│ ├── schema-definitions.md
│ └── troubleshooting.md
├── templates/
│ ├── output-template.md
│ └── report-format.json
├── examples/
│ ├── simple-case.md
│ └── complex-case.md
└── scripts/
├── validate.py
└── transform.shHandling Ambiguous Input: Clarify vs. Assume
When the user's request is unclear, your skill needs a strategy.
Real users don't speak in precise specifications. They say "fix the tests" when they mean "fix the three failing integration tests in the auth module." They say "make a report" when they mean "generate a quarterly revenue breakdown by region in markdown."
Your skill needs an explicit strategy for ambiguity. The wrong default — always asking for clarification or always assuming — creates different failure modes. Asking too many questions makes the skill feel slow and annoying. Assuming too aggressively produces wrong output that takes longer to fix than starting over.
| Situation | Strategy | Rationale |
|---|---|---|
| Destructive operations (deploy, delete, overwrite) | Always clarify | The cost of a wrong assumption is high and potentially irreversible |
| Output format not specified | Assume a sensible default, state it | Users correct format preferences quickly; blocking on format choice wastes time |
| Scope is ambiguous ("fix the tests") | Assume the narrowest reasonable scope | Fixing three tests and reporting back is better than asking which tests when the answer is obvious from context |
| Multiple valid approaches exist | Pick the conventional one, explain why | Users care about results more than methodology debates |
| Missing required parameters | Clarify with a specific suggestion | "Did you mean the staging environment?" is better than "Which environment?" |
| Domain-specific terminology | Use the project's glossary, don't ask | If the codebase calls it a "widget," use that term even if "component" is more standard |
SKILL.md## Ambiguity handling
When the user's request is underspecified:
1. **Destructive actions**: Always confirm before deploying, deleting,
or overwriting. Show exactly what will change.
2. **Scope**: Default to the narrowest reasonable interpretation.
If "fix the tests" could mean 3 tests or 30, fix the 3 failing
ones and report what you did.
3. **Format**: Default to markdown. State your assumption:
"Generating in markdown — let me know if you need a different format."
4. **Missing params**: Suggest the most likely value.
"Deploying to staging (the most recent target). Say 'production'
if you meant prod."Output Format Design for Humans and Machines
Your skill's output serves two audiences simultaneously.
A skill's output is rarely the final destination. In practice, output flows into two channels: a human reads it to decide what to do next, and a downstream process consumes it for workflow chaining. Designing for only one channel creates friction in the other.
Human-readable output that can't be parsed by the next step in a pipeline forces manual copy-paste. Machine-readable output that humans can't scan forces them to run a formatter before they can make decisions. The best skill output serves both without compromise.
Human readability principles
Lead with the answer, not the reasoning. Put the conclusion or result in the first line.
Use structured formatting (headers, lists, tables) so the eye can scan without reading every word.
Include context markers: what changed, what was skipped, what needs attention.
Keep status messages consistent: always "3 files updated, 1 skipped" not sometimes "Updated 3 files" and sometimes "Done."
Machine chainability principles
- ✓
Emit structured data (JSON, YAML) for any output that feeds another tool.
- ✓
Use the plan-validate-execute pattern: write a changes.json, validate it, then apply.
- ✓
Keep stdout clean. Diagnostic messages go to stderr or a log file, not mixed into the output stream.
- ✓
Return explicit exit codes: 0 for success, 1 for handled errors, 2 for unrecoverable failures.
- 1
Analyze and generate a plan file
bash# The skill generates a structured plan python scripts/analyze.py input.pdf > changes.json - 2
Validate the plan before executing
bash# A validation script checks the plan for errors python scripts/validate.py changes.json # Returns specific error messages like: # "Field 'signature_date' not found. Available: customer_name, order_total" - 3
Execute only after validation passes
bash# Apply changes only when the plan is verified python scripts/execute.py changes.json --output result.pdf - 4
Verify the output independently
bash# A separate verification step confirms correctness python scripts/verify.py result.pdf # Output: "OK: 12/12 fields populated, 0 validation errors"
Setting the Right Degrees of Freedom
Not every instruction needs to be a rigid command.
A common mistake is writing every instruction as an absolute rule. In practice, different parts of a skill need different levels of constraint. Database migrations need exact commands. Code reviews need general guidelines. Mixing these up — over-constraining flexible tasks or under-constraining fragile ones — is how skills break.[6]
The useful mental model is a landscape analogy. Some tasks are a narrow bridge with cliffs on both sides: there's one safe path, and deviation means failure. Other tasks are an open field: many paths reach the goal, and the agent's judgment about which route to take is usually better than a hardcoded choice.
Frontmatter Field Guide
Every field serves a purpose — here's when to use each one.
SKILL.md---
name: deploy-staging
description: >-
Deploys the application to the staging environment.
Use when the user says deploy, ship, push to staging,
or asks to test in a staging-like environment.
disable-model-invocation: true
allowed-tools: Bash(kubectl *), Bash(helm *), Read
context: fork
agent: general-purpose
---| Field | When to use | Common mistake |
|---|---|---|
| name | Always. Becomes the /slash-command. Lowercase, hyphens only. | Using spaces or uppercase. Using vague names like "helper." |
| description | Always. The triggering mechanism. Both what and when. | Writing in first person. Omitting trigger phrases. |
| disable-model-invocation | For skills with side effects: deploy, delete, send messages. | Leaving it off for destructive operations the agent should never auto-trigger. |
| user-invocable | Set to false for background knowledge skills. | Hiding skills that users should be able to invoke directly. |
| allowed-tools | When the skill needs specific tools without per-use approval. | Granting overly broad tool access like Bash(*) when Bash(git *) suffices. |
| context | Set to fork when the skill should run in isolation. | Forking a skill that contains only reference content and no actionable task. |
| agent | When a specific subagent type (Explore, Plan) fits better than general-purpose. | Not specifying an agent for research tasks that benefit from Explore's read-only toolset. |
The Testing Protocol
Ship with confidence using this five-point verification process.
Writing evaluations before writing extensive documentation is one of the highest-leverage practices in skill engineering.[4] It keeps you honest about whether the skill actually solves real problems rather than imagined ones.
The testing protocol below combines automated checks with observational testing. Automated checks catch structural problems. Observational testing catches behavioral problems — the skill triggers at the wrong time, ignores a reference file, or over-relies on one section of the instructions.
- 1
Build 3+ evaluation scenarios before writing docs
Identify concrete tasks the skill should handle. Run the agent on those tasks without the skill to document specific failures. These become your baseline. If the agent already handles the task well without the skill, you don't need the skill.
- 2
Test triggering across phrasings
The description field is your biggest failure point. Test with at least 5 different phrasings that a real user might use. Include phrasings from people who don't know the skill exists — they'll say 'make a report' not 'invoke the report-generator skill.'
- 3
Test across model tiers
What works perfectly for Opus might need more detail for Haiku. If your skill will be used across models, test with all of them. Haiku may need more explicit guidance. Opus may over-interpret verbose instructions.
- 4
Observe navigation patterns
Watch how the agent actually uses the skill in practice. Does it read files in the order you expected? Does it skip reference files it should read? Does it re-read the same section repeatedly — a sign that content should be promoted to SKILL.md?
- 5
Run the feedback loop with a second agent
Use one Claude instance (Claude A) to author and refine the skill. Use a separate instance (Claude B) to test it on real tasks. Claude A understands what agents need. Claude B reveals what's actually missing. Iterate between them.
Pre-ship skill file checklist
Description includes both WHAT the skill does and WHEN to use it
Description is written in third person
SKILL.md body is under 500 lines
Reference files are linked one level deep from SKILL.md
Destructive operations require explicit confirmation
Ambiguity strategy is documented (clarify vs. assume for each case)
Output format works for both human readers and downstream tools
3+ evaluation scenarios written and passing
Triggering tested with 5+ natural phrasings
Tested across Haiku, Sonnet, and Opus
No time-sensitive information in instructions
Consistent terminology throughout (no synonym drift)
All file paths use forward slashes
Validation scripts have verbose, specific error messages
Anti-Patterns That Kill Production Skills
Patterns that look reasonable but cause real failures.
After reviewing hundreds of skill files across open-source repositories and production codebases, certain failure patterns show up repeatedly.[6] These aren't hypothetical risks — they're the actual reasons skills break when they leave the author's machine.
Stuffing the markdown body with trigger conditions
The markdown body is only loaded after the skill triggers. Any "when to use this skill" guidance in the body is too late — the triggering decision was already made based on the description alone. Move all trigger conditions to the description field.
Offering too many options without a default
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image…" forces the agent to make an arbitrary choice that may not match your preference. Pick one default and mention alternatives only for specific edge cases: "Use pdfplumber for text extraction. For scanned PDFs requiring OCR, use pdf2image with pytesseract instead."
Deep reference nesting (file links to file links to file)
When the agent follows a chain of references, it may use partial reads (like head -100) on deeply nested files. Keep all references one level deep from SKILL.md. If reference-a.md needs to reference something in reference-b.md, link both directly from SKILL.md instead.
Voodoo constants in scripts
A TIMEOUT = 47 or MAX_RETRIES = 5 without explanation is a maintenance trap. The agent can't reason about whether these values are appropriate for the current situation. Document the reasoning: "Three retries balances reliability vs. speed — most intermittent failures resolve by the second retry."
Writing for one model and deploying across all
Instructions that work perfectly for Opus may leave Haiku confused. Instructions detailed enough for Haiku may cause Opus to over-interpret or follow them too literally. Test with every model you plan to deploy on, and adjust the level of detail accordingly.
Putting It All Together
A complete production skill example.
SKILL.md---
name: generate-changelog
description: >-
Generates a formatted changelog from git history between two refs.
Use when the user asks for a changelog, release notes, what changed,
or a summary of recent commits.
allowed-tools: Bash(git *), Read, Write
---
# Generate Changelog
## Workflow
1. Determine the ref range:
- If two refs provided, use them directly
- If one ref provided, use `ref..HEAD`
- If none provided, use the last tag to HEAD
- State your assumption: "Generating changelog from v2.1.0 to HEAD"
2. Gather commits:
```bash
git log --oneline --no-merges $FROM..$TO
```
3. Categorize by conventional commit prefix:
- feat: → Features
- fix: → Bug Fixes
- perf: → Performance
- docs: → Documentation
- Other → Maintenance
4. Generate output using the template in [templates/changelog.md](templates/changelog.md)
5. Write to CHANGELOG.md (append, don't overwrite)
## Ambiguity handling
- **No refs specified**: Default to last tag..HEAD. State the assumption.
- **Unclear scope**: Generate for the current branch only.
- **Format not specified**: Default to markdown grouped by category.This example demonstrates every principle from this guide working together. The description includes both what (generates a changelog) and when (user asks for changelog, release notes, what changed). The workflow uses the narrowest reasonable assumption for ambiguous inputs. The output format is designed for both human reading (grouped by category) and machine use (structured markdown). And the skill references a template file that loads only when needed.[3]
The real test isn't whether the skill works when you invoke it with /generate-changelog v2.0.0 v2.1.0. It's whether it triggers when a teammate types "what shipped this week" into the chat. Design for that phrasing, test for that phrasing, and you'll have a skill that actually gets used.
If your skill does not trigger, it is almost never the instructions. It is the description.
- [1]Anthropic — Claude Code Skills Documentation(code.claude.com)↩
- [2]Anthropic — Agent Skills Best Practices(platform.claude.com)↩
- [3]agentskills.io — Agent Skills Open Standard Specification(agentskills.io)↩
- [4]Towards Data Science — How to Build a Production-Ready Claude Code Skill(towardsdatascience.com)↩
- [5]Benjamin Abt — Agent Skills Standard: GitHub Copilot(benjamin-abt.com)↩
- [6]Bibek Poudel — The SKILL.md Pattern: How to Write AI Agent Skills That Actually Work(bibek-poudel.medium.com)↩
- [7]DeepWiki — 8.1 SKILL.md Format Specification(deepwiki.com)↩