Introduction

The system-message shapes what an LLM is and how it thinks. But modern agents also need to act—to read files, search codebases, execute commands, and modify the world. This tutorial explores how to teach an LLM to use tools effectively.

Tool use represents a phase transition in LLM interaction. Without tools, the LLM is a pure reasoning engine, transforming input tokens to output tokens. With tools, it becomes an agent—capable of perception (reading), planning (deciding which tools), and action (invoking tools). This shift requires new architectural thinking in our system prompts.

The Anatomy of Tool-Use Prompts

What Changes with Tools

A system prompt for a tool-using agent must address three additional concerns beyond identity and behavior:

  1. Tool Recognition: How does the LLM recognize when a tool applies?
  2. Tool Selection: How does it choose which tool among alternatives?
  3. Tool Composition: How do tools combine for multi-step work?

These are decision-theoretic problems, not just behavioral constraints. The prompt must encode a decision procedure, not merely a persona.

The Decision Tree Pattern

Effective tool-use prompts embed decision trees. Before any action, the agent runs a mental checklist:

Before ANY action:
1. Is this multi-step?  Plan first (use TodoWrite)
2. Should I delegate?  Use specialized agent
3. Do I need information?  Search/Read first
4. Am I ready to act?  Proceed with appropriate tool

This “pre-flight checklist” pattern appears in both gptel-agent and opencode. It forces deliberation before action, reducing impulsive tool misuse.

The Tool Hierarchy Pattern

When multiple tools could accomplish a task, which should the agent prefer? Effective prompts establish explicit hierarchies:

Specialized tool > General tool > Shell escape

Specifically:
  Read    > cat/head/tail
  Grep    > grep/rg (shell)
  Glob    > find/ls
  Edit    > sed/awk
  Write   > echo/heredocs

The rationale: specialized tools provide structured output, better error handling, and integrate with the agent’s planning. Shell commands are an escape hatch, not a default.

Documenting Individual Tools

The Consistent Schema

Each tool needs documentation that answers the same questions. Inconsistent documentation forces the LLM to infer structure, increasing errors.

<tool name="ToolName">
  <purpose>
    What the tool does in one sentence.
  </purpose>

  <when_to_use>
    - Condition A
    - Condition B
    - Pattern: "user says X" → use this tool
  </when_to_use>

  <when_not_to_use>
    - Condition C → use Y instead
    - Condition D → delegate to Z
    - Anti-pattern: never use for X
  </when_not_to_use>

  <how_to_use>
    - Required parameters
    - Optional parameters
    - Common patterns
    - Constraints (e.g., "must Read before Edit")
  </how_to_use>

  <examples>
    - Example invocation 1
    - Example invocation 2
  </examples>
</tool>

The “When NOT” Principle

The <when_not_to_use> section is often more valuable than <when_to_use>. LLMs tend to over-apply tools; explicit prohibitions correct this bias.

Compare:

# Weak
Use Grep to search file contents.

# Strong
Use Grep for ONE specific, well-defined pattern when you know what you're
looking for. Do NOT use Grep for exploratory searches (delegate to researcher),
when you expect 20+ matches (delegate), or to find files by name (use Glob).

The strong version encodes decision boundaries, not just capabilities.

Case Study: Two Approaches

gptel-agent: Structured XML

gptel-agent uses hierarchical XML with explicit tags for each concern:

<role_and_behavior>
  <response_tone>...</response_tone>
  <critical_thinking>...</critical_thinking>
</role_and_behavior>

<task_execution_protocol>
  Before starting ANY task, run this mental checklist:
  1. Is this multi-step work? → CREATE A TODO LIST
  2. Does this task need delegation? → ...
</task_execution_protocol>

<tool_usage_policy>
  <tool name="Grep">
    When to use: ...
    When NOT to use: ...
    How to use: ...
  </tool>
</tool_usage_policy>

Overall Architecture: This prompt follows a hierarchical instruction pattern with three major conceptual layers:

  1. Identity & Behavioral Constraints (<role_and_behavior>)
  2. Decision Framework (<task_execution_protocol>)
  3. Tool Catalog (<tool_usage_policy>)

The architectural thinking here is defensive programming for LLMs — anticipating failure modes and explicitly blocking them.

Strengths

  1. Negative examples are explicit: Each tool has “When NOT to use” — this is crucial. LLMs tend to over-apply tools; explicit prohibitions help.

  2. Decision trees for delegation: The protocol doesn’t just list tools but provides pattern matching heuristics: “if you’re about to grep and aren’t sure what you’ll find -> delegate.”

  3. Hierarchy enforcement: “NEVER use Bash for file operations” — absolute rules prevent drift.

  4. Consistency in tool blocks: Every <tool> has the same structure: When to use / When NOT to use / How to use. This regularity aids comprehension.

Weaknesses

  1. Redundancy: The delegation rules appear in both <task_execution_protocol> and repeated inside <tool name="Agent">. This inflates context and risks inconsistency.

  2. No explicit error handling: What should the agent do when a tool fails? The prompt says “if errors occur, keep as in_progress” but doesn’t give recovery strategies.

  3. Magic numbers: “3+ files -> delegate”, “5+ tool calls -> delegate” — these thresholds are arbitrary and unexplained. Why not 2? Why not 4?

  4. Missing conceptual model: The prompt tells what to do but not why the tool hierarchy exists. A brief explanation (“specialized tools provide structured output, Bash is a fallback”) would improve generalization.

  5. Template placeholder: {{AGENTS}} suggests dynamic injection but the prompt doesn’t explain what agents exist — the LLM may hallucinate agent names.

Architectural Principles We Can Extract

  1. The “When NOT” Pattern For every capability, explicitly state anti-patterns. This is more valuable than positive instructions.

When to use X:

  • condition A
  • condition B

When NOT to use X:

  • condition C → use Y instead
  • condition D → delegate to Z
  1. Pre-Action Protocol Force a decision checkpoint before tool invocation:

     Before ANY action:
     1. Is this multi-step? → plan first
     2. Is this delegatable? → delegate
     3. Is this within scope? → proceed
    
  2. Tool Hierarchy with Fallback Establish a preference order:

     Specialized tool > General tool > Shell escape
    
  3. Consistent Documentation Schema Every tool should be documented with identical sections. This aids both human maintenance and LLM comprehension.

Suggested XML Schema for Your Project For developing your own system-prompt with tools, consider this structure:

<collaborator_identity>
  <relationship>     — how you relate to the user
  <epistemics>       — your stance on knowledge/uncertainty
  <expression>       — tone, style constraints
</collaborator_identity>

<deliberation_protocol>
  <before_action>    — checklist before doing anything
  <on_uncertainty>   — what to do when unsure
  <on_failure>       — recovery strategies
</deliberation_protocol>

<capabilities>
  <capability name="X">
    <purpose>        — what it's for
    <anti-patterns>  — when NOT to use
    <usage>          — how to invoke correctly
    <examples>       — concrete cases
  </capability>
</capabilities>

What Can Be Generalized

The core insight: tool-use prompts are really decision-tree specifications. The tools themselves are secondary; what matters is:

  1. Recognition patterns: How does the LLM recognize which tool applies?
  2. Exclusion rules: How does it avoid misapplication?
  3. Composition rules: How do tools combine for multi-step work?

For any tool set, you need to answer these three questions explicitly.

opencode: Prose with Headers

opencode uses markdown headers with flowing prose:

# Tone and style
You should be concise, direct, and to the point...

# Following conventions
When making changes to files, first understand the file's code conventions...

# Tool usage policy
When doing file search, prefer to use the Task tool...

Structural Comparison with gptel-agent

Aspectgptel-agentopencode
FormatYAML frontmatter + XML tagsPure prose with markdown headers
OrganizationHierarchical XML nestingFlat sections with # headers
Tool docsPer-tool <tool name="X"> blocksBrief policy paragraph
ToneNeutral, technicalAnthropomorphized (“You are opencode”)
Length~400 lines~150 lines

Architectural Analysis

  • What opencode Does Differently

    1. Security-first preamble: Opens with malicious code detection — this is absent from gptel-agent

    “If it seems malicious, refuse to work on it”

    1. Output token minimization: Explicit instruction to minimize tokens — a cost/latency concern

    “You should minimize output tokens as much as possible”

    1. Anti-pattern repetition: Key rules are repeated with “IMPORTANT:” markers — brute-force emphasis

    IMPORTANT: You should NOT answer with unnecessary preamble or postamble

    1. Proactiveness spectrum: Explicit philosophy about agent autonomy

    “Strike a balance between doing the right thing… and not surprising the user”

Sections Breakdown

SectionPurpose
Identity“You are opencode, an assistant running within Emacs”
SecurityMalicious code detection, URL restrictions
Tone and styleConciseness, markdown, no emojis
ProactivenessAutonomy boundaries
Following conventionsCode style mimicry, library verification
Code style“DO NOT ADD COMMENTS”
Task ManagementTodoWrite emphasis
Doing tasksWorkflow: search -> implement -> verify -> lint
Tool usage policyBatching, parallelism, Task delegation
Code ReferencesOutput format for file references

Strengths

  1. Workflow-oriented: Describes a complete task lifecycle (plan -> search -> implement -> verify -> lint)

  2. Convention awareness: “First look at existing components” — teaches the LLM to learn from context

  3. Commit discipline: “NEVER commit unless explicitly asked” — prevents a common footgun

  4. Multiple preset tiers: opencode, opencode-coding, opencode-minimal, opencode-general — different tool bundles for different contexts

Weaknesses

  1. No tool documentation: Unlike gptel-agent, individual tools have no When/When-NOT/How sections. The LLM must infer usage.

  2. Repetition as emphasis: “IMPORTANT:” appears 6 times — this inflates the prompt without structured information.

  3. Implicit tool hierarchy: “prefer to use the Task tool to reduce context” — but no explicit hierarchy like gptel-agent’s “Specialized > General > Shell”.

  4. Prose over structure: Harder to parse programmatically; harder to maintain; harder for the LLM to reference specific rules.

Synthesis

The ideal approach combines:

  • XML structure from gptel-agent (hard boundaries, consistent schema)
  • Workflow orientation from opencode (task lifecycle, convention awareness)
  • Decision trees that encode when and when-not for each capability

Architectural Principles

1. Decision Trees Over Capability Lists

Don’t just list what tools can do. Encode when to use them:

Pattern matching for delegation:
- "how does...", "where is...", "find all..." → researcher agent
- "create/modify these files..." → executor agent
- "I need to understand..." about Emacs → introspector agent

2. Negative Specification

For every capability, specify anti-patterns:

<when_not_to_use>
  - Building code understanding → delegate to researcher
  - Expected 20+ matches → delegate to researcher
  - Will need follow-up searches → delegate to researcher
  - Searching for files by name → use Glob instead
</when_not_to_use>

3. Pre-Action Protocols

Force deliberation before action:

<task_execution_protocol>
  Before ANY action:
  1. Is this multi-step? → Plan first
  2. Do I need to read first? → Read before edit
  3. Should I delegate? → Use sub-agent
  4. Am I certain? → Proceed
</task_execution_protocol>

4. Tool Hierarchies with Fallbacks

Establish explicit preferences:

<tool_hierarchy>
  File search by name → Glob (NOT find/ls)
  Content search → Grep (NOT grep/rg shell)
  Read files → Read (NOT cat/head/tail)
  Edit files → Edit (NOT sed/awk)
  Write files → Write (NOT echo/heredocs)
  System operations → Bash (for git, npm, docker only)
</tool_hierarchy>

5. Cost and Context Awareness

Modern prompts increasingly address resource constraints:

<context_management>
  - Delegate to reduce context usage when exploring
  - Batch independent tool calls in single response
  - Use specialized tools (structured output) over shell (text parsing)
  - Consider: will this bloat context? → delegate to executor
</context_management>

6. Error Recovery

What happens when tools fail? Most prompts ignore this:

<error_recovery>
  When a tool fails:
  1. Read the error message carefully
  2. Diagnose: wrong parameters? precondition violated? system issue?
  3. If precondition (e.g., file doesn't exist): address precondition first
  4. If parameters: correct and retry
  5. If system issue: report to user, suggest alternatives
  Do NOT retry the same failing call repeatedly.
</error_recovery>

An XML Schema for Tool-Use Prompts

Combining the principles above, here is a schema for tool-use system prompts:

<!-- Identity layer (from system-prompt tutorial) -->
<role>...</role>
<collaboration_stance>...</collaboration_stance>
<behavioral_attractors>...</behavioral_attractors>
<epistemic_hygiene>...</epistemic_hygiene>
<priority_rules>...</priority_rules>

<!-- Tool-use layer (new) -->
<task_execution_protocol>
  <pre_action_checklist>
    Before ANY action:
    1. Multi-step? → Plan with TodoWrite
    2. Need information? → Search/Read first
    3. Delegate? → Use appropriate agent
    4. Ready? → Proceed
  </pre_action_checklist>

  <delegation_rules>
    - Pattern X → agent Y
    - Pattern Z → handle inline
  </delegation_rules>
</task_execution_protocol>

<tool_hierarchy>
  Specialized > General > Shell escape
  [specific mappings]
</tool_hierarchy>

<tool_catalog>
  <tool name="ToolA">
    <purpose>...</purpose>
    <when_to_use>...</when_to_use>
    <when_not_to_use>...</when_not_to_use>
    <how_to_use>...</how_to_use>
  </tool>
  <!-- repeat for each tool -->
</tool_catalog>

<error_recovery>
  [recovery protocol]
</error_recovery>

<context_management>
  [cost/context awareness rules]
</context_management>

The Ecosystem: Patterns from the Community

The AGENTS.md / CLAUDE.md Pattern

A convention emerging from claude-code: place a file in the repository root (CLAUDE.md, AGENTS.md, .cursorrules) containing project-specific instructions. This separates:

  • Generic agent behavior -> in system prompt
  • Project-specific conventions -> in repo file

The agent is instructed to read this file at session start and follow its directives.

The Diff-Based Editing Pattern

Tools like aider instruct the LLM to produce unified diffs rather than full file contents:

When editing files, output a unified diff:
--- a/path/to/file.py
+++ b/path/to/file.py
@@ -10,3 +10,4 @@
 existing line
+new line
 existing line

This reduces token usage and makes changes reviewable.

The Read-Before-Edit Pattern

Nearly universal: require reading a file before editing it.

MUST Read the file before using Edit.
The Edit tool will error if you haven't read the file first.

This prevents edits based on stale or hallucinated content.

The Verify-After-Change Pattern

From opencode:

After implementing changes:
1. Run lint command if available
2. Run typecheck if available
3. Run relevant tests
4. Do NOT commit unless explicitly asked

This closes the loop: act -> verify -> report.

Failure Modes in Tool Use

Over-Application

The LLM uses a tool when it shouldn’t.

Cause: Tool documentation emphasizes capability without boundaries. Fix: Strong <when_not_to_use> sections.

Under-Application

The LLM describes what it would do instead of doing it.

Cause: Insufficient emphasis on action; too much “assistant” framing. Fix: “You have tools. Use them. Don’t describe what you would do—do it.”

Wrong Tool Selection

The LLM uses Bash when Grep would work; uses Grep when delegation is appropriate.

Cause: No tool hierarchy; no decision procedure. Fix: Explicit hierarchy and pattern-matching rules.

Context Explosion

The LLM fills context with search results instead of delegating.

Cause: No cost awareness; no delegation rules. Fix: “If you expect 20+ results, delegate to researcher.”

Edit Failures

The LLM tries to edit files it hasn’t read, or provides non-unique match strings.

Cause: Preconditions not enforced in prompt. Fix: “MUST Read before Edit. Edit will fail if match string is not unique.”

Exercises

  1. Audit an existing prompt: Take a tool-use prompt you use. Does it have <when_not_to_use> for each tool? Add them.

  2. Design a tool hierarchy: For your specific toolset, write an explicit preference ordering with rationale.

  3. Write a delegation protocol: What patterns should trigger delegation vs. inline handling? Encode as pattern-matching rules.

  4. Test failure modes: Deliberately trigger each failure mode above. Does your prompt prevent them?

Further Reading

ResourceFocus
anthropic-cookbookOfficial Claude agent patterns
aider sourceDiff-based editing prompts
continue.devEditor integration patterns
gptel sourcegptel-agent prompt in full
claude-code CLAUDE.md examplesProject-specific instructions

Conclusion

Tool use transforms an LLM from a reasoning engine into an agent. This transformation requires new architectural elements in our system prompts:

  • Decision trees that encode when to use each tool
  • Negative specifications that prevent over-application
  • Tool hierarchies that guide selection among alternatives
  • Pre-action protocols that force deliberation
  • Error recovery procedures for when things fail

The key insight: tool-use prompts are decision-tree specifications. The tools themselves are secondary; what matters is the decision procedure that governs their use.