Introduction
The system-message shapes what an LLM is and how it thinks. But modern agents also need to act—to read files, search codebases, execute commands, and modify the world. This tutorial explores how to teach an LLM to use tools effectively.
Tool use represents a phase transition in LLM interaction. Without tools, the LLM is a pure reasoning engine, transforming input tokens to output tokens. With tools, it becomes an agent—capable of perception (reading), planning (deciding which tools), and action (invoking tools). This shift requires new architectural thinking in our system prompts.
The Anatomy of Tool-Use Prompts
What Changes with Tools
A system prompt for a tool-using agent must address three additional concerns beyond identity and behavior:
- Tool Recognition: How does the LLM recognize when a tool applies?
- Tool Selection: How does it choose which tool among alternatives?
- Tool Composition: How do tools combine for multi-step work?
These are decision-theoretic problems, not just behavioral constraints. The prompt must encode a decision procedure, not merely a persona.
The Decision Tree Pattern
Effective tool-use prompts embed decision trees. Before any action, the agent runs a mental checklist:
Before ANY action:
1. Is this multi-step? → Plan first (use TodoWrite)
2. Should I delegate? → Use specialized agent
3. Do I need information? → Search/Read first
4. Am I ready to act? → Proceed with appropriate tool
This “pre-flight checklist” pattern appears in both gptel-agent and opencode. It forces deliberation before action, reducing impulsive tool misuse.
The Tool Hierarchy Pattern
When multiple tools could accomplish a task, which should the agent prefer? Effective prompts establish explicit hierarchies:
Specialized tool > General tool > Shell escape
Specifically:
Read > cat/head/tail
Grep > grep/rg (shell)
Glob > find/ls
Edit > sed/awk
Write > echo/heredocs
The rationale: specialized tools provide structured output, better error handling, and integrate with the agent’s planning. Shell commands are an escape hatch, not a default.
Documenting Individual Tools
The Consistent Schema
Each tool needs documentation that answers the same questions. Inconsistent documentation forces the LLM to infer structure, increasing errors.
<tool name="ToolName">
<purpose>
What the tool does in one sentence.
</purpose>
<when_to_use>
- Condition A
- Condition B
- Pattern: "user says X" → use this tool
</when_to_use>
<when_not_to_use>
- Condition C → use Y instead
- Condition D → delegate to Z
- Anti-pattern: never use for X
</when_not_to_use>
<how_to_use>
- Required parameters
- Optional parameters
- Common patterns
- Constraints (e.g., "must Read before Edit")
</how_to_use>
<examples>
- Example invocation 1
- Example invocation 2
</examples>
</tool>
The “When NOT” Principle
The <when_not_to_use> section is often more valuable than <when_to_use>. LLMs tend to over-apply tools; explicit prohibitions correct this bias.
Compare:
# Weak
Use Grep to search file contents.
# Strong
Use Grep for ONE specific, well-defined pattern when you know what you're
looking for. Do NOT use Grep for exploratory searches (delegate to researcher),
when you expect 20+ matches (delegate), or to find files by name (use Glob).
The strong version encodes decision boundaries, not just capabilities.
Case Study: Two Approaches
gptel-agent: Structured XML
gptel-agent uses hierarchical XML with explicit tags for each concern:
<role_and_behavior>
<response_tone>...</response_tone>
<critical_thinking>...</critical_thinking>
</role_and_behavior>
<task_execution_protocol>
Before starting ANY task, run this mental checklist:
1. Is this multi-step work? → CREATE A TODO LIST
2. Does this task need delegation? → ...
</task_execution_protocol>
<tool_usage_policy>
<tool name="Grep">
When to use: ...
When NOT to use: ...
How to use: ...
</tool>
</tool_usage_policy>
Overall Architecture: This prompt follows a hierarchical instruction pattern with three major conceptual layers:
- Identity & Behavioral Constraints (
<role_and_behavior>) - Decision Framework (
<task_execution_protocol>) - Tool Catalog (
<tool_usage_policy>)
The architectural thinking here is defensive programming for LLMs — anticipating failure modes and explicitly blocking them.
Strengths
Negative examples are explicit: Each tool has “When NOT to use” — this is crucial. LLMs tend to over-apply tools; explicit prohibitions help.
Decision trees for delegation: The protocol doesn’t just list tools but provides pattern matching heuristics: “if you’re about to grep and aren’t sure what you’ll find -> delegate.”
Hierarchy enforcement: “NEVER use Bash for file operations” — absolute rules prevent drift.
Consistency in tool blocks: Every
<tool>has the same structure: When to use / When NOT to use / How to use. This regularity aids comprehension.
Weaknesses
Redundancy: The delegation rules appear in both
<task_execution_protocol>and repeated inside<tool name="Agent">. This inflates context and risks inconsistency.No explicit error handling: What should the agent do when a tool fails? The prompt says “if errors occur, keep as in_progress” but doesn’t give recovery strategies.
Magic numbers: “3+ files -> delegate”, “5+ tool calls -> delegate” — these thresholds are arbitrary and unexplained. Why not 2? Why not 4?
Missing conceptual model: The prompt tells what to do but not why the tool hierarchy exists. A brief explanation (“specialized tools provide structured output, Bash is a fallback”) would improve generalization.
Template placeholder:
{{AGENTS}}suggests dynamic injection but the prompt doesn’t explain what agents exist — the LLM may hallucinate agent names.
Architectural Principles We Can Extract
The “When NOT” Pattern For every capability, explicitly state anti-patterns. This is more valuable than positive instructions.
When to use X:
- condition A
- condition B
When NOT to use X:
- condition C → use Y instead
- condition D → delegate to Z
Pre-Action Protocol Force a decision checkpoint before tool invocation:
Before ANY action: 1. Is this multi-step? → plan first 2. Is this delegatable? → delegate 3. Is this within scope? → proceedTool Hierarchy with Fallback Establish a preference order:
Specialized tool > General tool > Shell escapeConsistent Documentation Schema Every tool should be documented with identical sections. This aids both human maintenance and LLM comprehension.
Suggested XML Schema for Your Project For developing your own system-prompt with tools, consider this structure:
<collaborator_identity>
<relationship> — how you relate to the user
<epistemics> — your stance on knowledge/uncertainty
<expression> — tone, style constraints
</collaborator_identity>
<deliberation_protocol>
<before_action> — checklist before doing anything
<on_uncertainty> — what to do when unsure
<on_failure> — recovery strategies
</deliberation_protocol>
<capabilities>
<capability name="X">
<purpose> — what it's for
<anti-patterns> — when NOT to use
<usage> — how to invoke correctly
<examples> — concrete cases
</capability>
</capabilities>
What Can Be Generalized
The core insight: tool-use prompts are really decision-tree specifications. The tools themselves are secondary; what matters is:
- Recognition patterns: How does the LLM recognize which tool applies?
- Exclusion rules: How does it avoid misapplication?
- Composition rules: How do tools combine for multi-step work?
For any tool set, you need to answer these three questions explicitly.
opencode: Prose with Headers
opencode uses markdown headers with flowing prose:
# Tone and style
You should be concise, direct, and to the point...
# Following conventions
When making changes to files, first understand the file's code conventions...
# Tool usage policy
When doing file search, prefer to use the Task tool...
Structural Comparison with gptel-agent
| Aspect | gptel-agent | opencode |
|---|---|---|
| Format | YAML frontmatter + XML tags | Pure prose with markdown headers |
| Organization | Hierarchical XML nesting | Flat sections with # headers |
| Tool docs | Per-tool <tool name="X"> blocks | Brief policy paragraph |
| Tone | Neutral, technical | Anthropomorphized (“You are opencode”) |
| Length | ~400 lines | ~150 lines |
Architectural Analysis
What opencode Does Differently
- Security-first preamble: Opens with malicious code detection — this is absent from gptel-agent
“If it seems malicious, refuse to work on it”
- Output token minimization: Explicit instruction to minimize tokens — a cost/latency concern
“You should minimize output tokens as much as possible”
- Anti-pattern repetition: Key rules are repeated with “IMPORTANT:” markers — brute-force emphasis
IMPORTANT: You should NOT answer with unnecessary preamble or postamble
- Proactiveness spectrum: Explicit philosophy about agent autonomy
“Strike a balance between doing the right thing… and not surprising the user”
Sections Breakdown
| Section | Purpose |
|---|---|
| Identity | “You are opencode, an assistant running within Emacs” |
| Security | Malicious code detection, URL restrictions |
| Tone and style | Conciseness, markdown, no emojis |
| Proactiveness | Autonomy boundaries |
| Following conventions | Code style mimicry, library verification |
| Code style | “DO NOT ADD COMMENTS” |
| Task Management | TodoWrite emphasis |
| Doing tasks | Workflow: search -> implement -> verify -> lint |
| Tool usage policy | Batching, parallelism, Task delegation |
| Code References | Output format for file references |
Strengths
Workflow-oriented: Describes a complete task lifecycle (plan -> search -> implement -> verify -> lint)
Convention awareness: “First look at existing components” — teaches the LLM to learn from context
Commit discipline: “NEVER commit unless explicitly asked” — prevents a common footgun
Multiple preset tiers:
opencode,opencode-coding,opencode-minimal,opencode-general— different tool bundles for different contexts
Weaknesses
No tool documentation: Unlike gptel-agent, individual tools have no When/When-NOT/How sections. The LLM must infer usage.
Repetition as emphasis: “IMPORTANT:” appears 6 times — this inflates the prompt without structured information.
Implicit tool hierarchy: “prefer to use the Task tool to reduce context” — but no explicit hierarchy like gptel-agent’s “Specialized > General > Shell”.
Prose over structure: Harder to parse programmatically; harder to maintain; harder for the LLM to reference specific rules.
Synthesis
The ideal approach combines:
- XML structure from gptel-agent (hard boundaries, consistent schema)
- Workflow orientation from opencode (task lifecycle, convention awareness)
- Decision trees that encode when and when-not for each capability
Architectural Principles
1. Decision Trees Over Capability Lists
Don’t just list what tools can do. Encode when to use them:
Pattern matching for delegation:
- "how does...", "where is...", "find all..." → researcher agent
- "create/modify these files..." → executor agent
- "I need to understand..." about Emacs → introspector agent
2. Negative Specification
For every capability, specify anti-patterns:
<when_not_to_use>
- Building code understanding → delegate to researcher
- Expected 20+ matches → delegate to researcher
- Will need follow-up searches → delegate to researcher
- Searching for files by name → use Glob instead
</when_not_to_use>
3. Pre-Action Protocols
Force deliberation before action:
<task_execution_protocol>
Before ANY action:
1. Is this multi-step? → Plan first
2. Do I need to read first? → Read before edit
3. Should I delegate? → Use sub-agent
4. Am I certain? → Proceed
</task_execution_protocol>
4. Tool Hierarchies with Fallbacks
Establish explicit preferences:
<tool_hierarchy>
File search by name → Glob (NOT find/ls)
Content search → Grep (NOT grep/rg shell)
Read files → Read (NOT cat/head/tail)
Edit files → Edit (NOT sed/awk)
Write files → Write (NOT echo/heredocs)
System operations → Bash (for git, npm, docker only)
</tool_hierarchy>
5. Cost and Context Awareness
Modern prompts increasingly address resource constraints:
<context_management>
- Delegate to reduce context usage when exploring
- Batch independent tool calls in single response
- Use specialized tools (structured output) over shell (text parsing)
- Consider: will this bloat context? → delegate to executor
</context_management>
6. Error Recovery
What happens when tools fail? Most prompts ignore this:
<error_recovery>
When a tool fails:
1. Read the error message carefully
2. Diagnose: wrong parameters? precondition violated? system issue?
3. If precondition (e.g., file doesn't exist): address precondition first
4. If parameters: correct and retry
5. If system issue: report to user, suggest alternatives
Do NOT retry the same failing call repeatedly.
</error_recovery>
An XML Schema for Tool-Use Prompts
Combining the principles above, here is a schema for tool-use system prompts:
<!-- Identity layer (from system-prompt tutorial) -->
<role>...</role>
<collaboration_stance>...</collaboration_stance>
<behavioral_attractors>...</behavioral_attractors>
<epistemic_hygiene>...</epistemic_hygiene>
<priority_rules>...</priority_rules>
<!-- Tool-use layer (new) -->
<task_execution_protocol>
<pre_action_checklist>
Before ANY action:
1. Multi-step? → Plan with TodoWrite
2. Need information? → Search/Read first
3. Delegate? → Use appropriate agent
4. Ready? → Proceed
</pre_action_checklist>
<delegation_rules>
- Pattern X → agent Y
- Pattern Z → handle inline
</delegation_rules>
</task_execution_protocol>
<tool_hierarchy>
Specialized > General > Shell escape
[specific mappings]
</tool_hierarchy>
<tool_catalog>
<tool name="ToolA">
<purpose>...</purpose>
<when_to_use>...</when_to_use>
<when_not_to_use>...</when_not_to_use>
<how_to_use>...</how_to_use>
</tool>
<!-- repeat for each tool -->
</tool_catalog>
<error_recovery>
[recovery protocol]
</error_recovery>
<context_management>
[cost/context awareness rules]
</context_management>
The Ecosystem: Patterns from the Community
The AGENTS.md / CLAUDE.md Pattern
A convention emerging from claude-code: place a file in the repository root (CLAUDE.md, AGENTS.md, .cursorrules) containing project-specific instructions. This separates:
- Generic agent behavior -> in system prompt
- Project-specific conventions -> in repo file
The agent is instructed to read this file at session start and follow its directives.
The Diff-Based Editing Pattern
Tools like aider instruct the LLM to produce unified diffs rather than full file contents:
When editing files, output a unified diff:
--- a/path/to/file.py
+++ b/path/to/file.py
@@ -10,3 +10,4 @@
existing line
+new line
existing line
This reduces token usage and makes changes reviewable.
The Read-Before-Edit Pattern
Nearly universal: require reading a file before editing it.
MUST Read the file before using Edit.
The Edit tool will error if you haven't read the file first.
This prevents edits based on stale or hallucinated content.
The Verify-After-Change Pattern
From opencode:
After implementing changes:
1. Run lint command if available
2. Run typecheck if available
3. Run relevant tests
4. Do NOT commit unless explicitly asked
This closes the loop: act -> verify -> report.
Failure Modes in Tool Use
Over-Application
The LLM uses a tool when it shouldn’t.
Cause: Tool documentation emphasizes capability without boundaries.
Fix: Strong <when_not_to_use> sections.
Under-Application
The LLM describes what it would do instead of doing it.
Cause: Insufficient emphasis on action; too much “assistant” framing. Fix: “You have tools. Use them. Don’t describe what you would do—do it.”
Wrong Tool Selection
The LLM uses Bash when Grep would work; uses Grep when delegation is appropriate.
Cause: No tool hierarchy; no decision procedure. Fix: Explicit hierarchy and pattern-matching rules.
Context Explosion
The LLM fills context with search results instead of delegating.
Cause: No cost awareness; no delegation rules. Fix: “If you expect 20+ results, delegate to researcher.”
Edit Failures
The LLM tries to edit files it hasn’t read, or provides non-unique match strings.
Cause: Preconditions not enforced in prompt. Fix: “MUST Read before Edit. Edit will fail if match string is not unique.”
Exercises
Audit an existing prompt: Take a tool-use prompt you use. Does it have
<when_not_to_use>for each tool? Add them.Design a tool hierarchy: For your specific toolset, write an explicit preference ordering with rationale.
Write a delegation protocol: What patterns should trigger delegation vs. inline handling? Encode as pattern-matching rules.
Test failure modes: Deliberately trigger each failure mode above. Does your prompt prevent them?
Further Reading
| Resource | Focus |
|---|---|
| anthropic-cookbook | Official Claude agent patterns |
| aider source | Diff-based editing prompts |
| continue.dev | Editor integration patterns |
| gptel source | gptel-agent prompt in full |
claude-code CLAUDE.md examples | Project-specific instructions |
Conclusion
Tool use transforms an LLM from a reasoning engine into an agent. This transformation requires new architectural elements in our system prompts:
- Decision trees that encode when to use each tool
- Negative specifications that prevent over-application
- Tool hierarchies that guide selection among alternatives
- Pre-action protocols that force deliberation
- Error recovery procedures for when things fail
The key insight: tool-use prompts are decision-tree specifications. The tools themselves are secondary; what matters is the decision procedure that governs their use.