Skip to content

Remediation Guide

When ToolWitness flags a failure, you need to know three things: what happened, why it happened, and how to fix it. This page walks through the complete remediation workflow for each classification type.


The Remediation Workflow

See the problem → Understand it → Fix it → Verify the fix

1. See the Problem

ToolWitness surfaces failures through multiple channels:

  • Dashboard (toolwitness dashboard) — live overview with classification breakdown and recent failures
  • CLI (toolwitness check --last 10) — quick terminal check
  • HTML Report (toolwitness report --format html) — shareable artifact with full evidence
  • Alerts — webhook or Slack notifications when a failure matches your rules. Alerts contain classification metadata only (tool name, confidence) — never code, file contents, or prompts. Privacy details →

2. Understand It

Every failure includes:

  • Classification — FABRICATED, SKIPPED, or EMBELLISHED
  • Confidence score — how certain ToolWitness is about the classification (0.0 to 1.0)
  • Evidence — which values matched, which were mismatched, and what extra claims the agent made. All evidence stays in your local SQLite database — nothing is transmitted.
  • Chain context — if the failure involves data flowing between tools, the chain break is highlighted

3. Fix It

Each classification type has specific, actionable fixes. ToolWitness shows these as remediation cards in the dashboard and HTML report.

4. Verify the Fix

After applying a fix, re-run your agent (or repeat the action in your MCP host) and check:

toolwitness check --last 5

For CI pipelines, use the gate:

toolwitness check --fail-if "failure_rate > 0.05"

SKIPPED — Tool Was Never Called

The agent claimed it called a tool, but ToolWitness has no execution receipt. The tool function never ran.

Why It Happens

Root cause Frequency
Model "knows" the answer — training data contains plausible responses, so the model skips the tool call Common
Framework bug — tool calls silently dropped (known issues in CrewAI, LangGraph, AutoGen) Occasional
Weak prompting — system prompt doesn't require tool use for this query type Common
Cost optimization — some frameworks skip tool calls when the model seems "confident enough" Rare

Fixes

Fix 1: Force tool calling (highest confidence)

Set tool_choice to require the specific tool. The model must call it.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_weather"}},
)
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=messages,
    tools=tools,
    tool_choice={"type": "tool", "name": "get_weather"},
)

Effort: 1 line of code | Effectiveness: Guaranteed

Fix 2: Strengthen system prompt

You MUST call get_weather for any weather question.
Never estimate or answer from memory.

Effort: 2 minutes | Effectiveness: High for prompt-caused skips

Fix 3: Add retry logic

If the tool call is missing in the agent's response, re-prompt:

if not has_tool_call(response, "get_weather"):
    response = client.chat.completions.create(
        messages=[*messages, {"role": "user", "content":
            "You didn't call get_weather. Please call it now."}],
        tools=tools,
    )

Effort: Small code change | Effectiveness: High

MCP Proxy users

If you're using the MCP Proxy (Cursor, Claude Desktop), you don't control the agent code directly. Your options:

  • Check host model settings — some MCP hosts let you select the model or adjust temperature. Lower temperature reduces skipping.
  • Verify the MCP server is healthy — run the server command directly to confirm it responds. A non-responsive server can look like a skip.
  • Reduce the number of exposed tools — hosts are more likely to skip tool calls when many tools are available and the model decides it already "knows" the answer.
  • Report with evidence — use toolwitness check output to file a bug report with the host application, showing the SKIPPED classification and missing receipt.

FABRICATED — Agent Misrepresented Tool Output

The tool was called and returned data, but the agent's claims about the result don't match what came back. This is the most dangerous failure because the execution trace looks clean.

Why It Happens

Root cause Frequency
Prior knowledge conflict — tool returns data that conflicts with training, model "corrects" it Common
Context rot — as sessions grow longer, attention dilutes and the model loses track of earlier tool outputs (see Understanding context rot below) Common
Lossy summarization — model summarizes complex output and introduces errors Occasional
Multi-turn drift — data gets corrupted as it flows through the chain Occasional

Fixes

Fix 1: Use structured output

Force JSON responses that reference specific tool fields, not free-text:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={"type": "json_object"},
)

Effort: Moderate refactor | Effectiveness: High — constrains the model

Fix 2: Add faithfulness instruction

Report the EXACT values from tool outputs.
Do not round, convert, or interpret. Quote numbers precisely.

Effort: 2 minutes | Effectiveness: Medium — models don't always follow

Fix 3: Reduce context window

Trim conversation history so only the current tool output is visible:

recent_messages = messages[-3:]  # system + user + tool result only

Effort: Small code change | Effectiveness: High for context-confusion cases

Fix 4: Break up complex tasks

Instead of one agent doing 5 tool calls, chain smaller agents:

customer_agent = Agent(tools=[get_customer])
orders_agent = Agent(tools=[get_orders])

Effort: Architecture change | Effectiveness: High but more effort

MCP Proxy users

If you're using the MCP Proxy, you have limited control over how the host agent processes tool results. Your options:

  • Adjust host system prompt — some MCP hosts (e.g., Cursor rules, Claude Desktop system prompts) let you add faithfulness instructions. Add: "Report exact values from tool outputs."
  • Reduce tools per session — expose fewer MCP tools to reduce context pressure. Fabrication increases when the model juggles many tool results.
  • Use the evidence to evaluate hosts — if one host consistently fabricates while another doesn't, that's a meaningful signal for host selection. ToolWitness gives you the data to compare.
  • Report with evidence — use toolwitness check or toolwitness report --format html to document fabrication patterns and share them with the host application team.

EMBELLISHED — Agent Added Ungrounded Claims

The agent accurately reported the tool output but added claims that didn't come from any tool. Example: tool returned temperature data, agent added "It's a lovely day for a walk in Hyde Park."

Why It Happens

The model is doing what LLMs do — generating contextually plausible text. This isn't always wrong.

Fixes (Domain-Dependent)

Domain Action Config
High-stakes (financial, medical, legal) Tighten prompt: require strict faithfulness embellishment_alert: true
Conversational (chatbot, assistant) Accept it — users prefer natural responses embellishment_alert: false
Mixed Alert but don't count as failure embellishment_alert: true, embellishment_severity: info

For high-stakes domains:

Only report data that came directly from tool outputs.
Do not add context, opinions, or suggestions unless
explicitly asked.

MCP Proxy users

Embellishment guidance is the same regardless of integration path — it depends on your domain, not your tooling. If your MCP host lets you configure system prompts or rules (e.g., Cursor rules files), add faithfulness instructions there. If not, evaluate whether the embellishment is acceptable for your use case.


Action Buttons

The dashboard failure detail page includes action buttons:

Button What It Does
Mark False Positive Flags this verification as incorrect. Feeds the false-positive corpus used to improve classification accuracy.
Create Issue Opens a pre-filled GitHub issue with the failure details, classification, and evidence.
Add to Test Suite (planned) Will save the failure as a replay fixture for regression testing. Coming in a future release.

Understanding Context Rot

Context rot is the silent degradation of LLM accuracy as the context window fills up. It's the single most common root cause of FABRICATED classifications in long-running agent sessions. The information isn't wrong or missing — the model just pays less attention to it.

Why it causes fabrication

Transformer attention is distributed unevenly. Tokens near the beginning and end of the context window receive more focus; information in the middle gets less. A Stanford study (arXiv:2307.03172) found that the same facts placed at position 1 in retrieved context yield 75% accuracy, but at position 10, accuracy drops to 55% — based entirely on position, not content quality.

In practice, this means: after 5–10 tool calls in a session, earlier tool outputs get pushed into the low-attention middle zone. The agent can see the tool was called (the message is there), but can't effectively attend to the actual data. So it fills in from training knowledge — or confuses data across different tool calls.

This is why ToolWitness testing found 100% fabrication rate when agents were tested with 5 sequential tool calls (overloaded context), but 0% fabrication with a single tool call (clean, short context). Same tools, same data, same agent — the only variable was context length.

Three causes of context rot

Cause What happens
Attention dilution At 100K tokens the model tracks ~10 billion pairwise relationships. Attention spreads thinner as context grows.
Noise scaling Redundancy, loose associations, and subtle contradictions compound faster than useful signal.
Positional bias The "lost-in-the-middle" problem — models perform best when relevant data sits at the very start or very end.

What you can do about it

These fixes complement the FABRICATED remediation steps above:

Strategy How it helps Effort
Trim conversation history Keep only the current tool result visible; archive earlier turns Low
Break into sub-agents Each sub-agent gets a clean context with only its tools Medium
Chunk long tool outputs Process results in smaller pieces instead of one massive return Medium
Monitor context length Track token count per session; correlate with failure rate Low

The pattern: fabrication is not random misbehavior. It's a predictable consequence of how attention works in transformers. Shorter contexts produce more faithful responses. ToolWitness detects the symptoms; managing context length prevents the cause.


CI Integration

Add ToolWitness as a CI gate to prevent regressions:

# GitHub Actions example
- name: Check for fabrications
  run: |
    toolwitness check --fail-if "fabricated_count > 0"

- name: Check failure rate
  run: |
    toolwitness check --fail-if "failure_rate > 0.05"

The check command exits with code 1 when the condition is met, failing the build.


Next Steps

  • Gallery — see the dashboard and reports in action
  • Testing Results — how we validated ToolWitness catches real fabrication
  • How It Works — understand the verification engine