Gallery¶
See what ToolWitness looks like in action. Every surface is built into the open-source package — no account, no cloud, no cost.
Dashboard Overview¶
100% local — nothing leaves your machine
The dashboard is a local HTTP server that reads from your SQLite database. No cloud service, no account, no data transmitted anywhere. Run toolwitness dashboard, open http://localhost:8321 in your browser, and Ctrl+C when you're done. Same pattern as TensorBoard or mkdocs serve.
Run toolwitness dashboard and open localhost:8321 to see the live dashboard. It auto-refreshes every 5 seconds.
What you see:
- KPI cards — total verifications, failure rate (color-coded: green < 5%, yellow < 15%, red above), verified count, failure count
- Classification breakdown — horizontal bars showing the distribution across all five classifications (Verified, Embellished, Fabricated, Skipped, Unmonitored)
- Per-tool failure rates — ranked table showing which tools fail most often
- Recent verifications — live feed of the latest tool verification results with classification badges
┌─────────────────────────────────────────────────────────────┐
│ ToolWitness Dashboard Last 24h │
├─────────────────────────────────────────────────────────────┤
│ │
│ Verifications Failure Rate Verified Failures │
│ 18 33.3% 10 6 │
│ │
│ Classification Breakdown Per-Tool Failure Rates │
│ ■■■■■■■■■ Verified 10 (56%) send_email 50.0% │
│ ■■ Embellished 2 (11%) get_weather 33.3% │
│ ■■■■ Fabricated 4 (22%) check_coverage 100.0% │
│ ■■ Skipped 2 (11%) get_stock_price 50.0% │
│ │
│ Recent Verifications │
│ get_customer VERIFIED 0.97 session: a1b2c3d4 │
│ send_email FABRICATED 0.89 session: a1b2c3d4 │
│ get_weather FABRICATED 0.92 session: e5f6g7h8 │
└─────────────────────────────────────────────────────────────┘
Session Timeline (the "aha moment")¶
The session timeline shows every tool call as a color-coded node with arrows showing data flow. Chain breaks — where data gets corrupted between steps — are immediately visible.
- Green (✓) = Verified
- Yellow (⚠) = Embellished
- Red (✗) = Fabricated
- Gray (⊘) = Skipped
A developer runs their agent, opens the dashboard, and instantly sees which steps were trustworthy and which weren't — without reading any logs.
Failure Detail Cards¶
Click any failure in the dashboard or see them in the HTML report. Each card shows:
- Classification badge with confidence score
- Evidence breakdown — which values matched, which were mismatched, and what extra claims the agent made
- Remediation suggestions — actionable fixes with code examples (see Remediation)
┌─────────────────────────────────────────────────────────────┐
│ ✗ send_email FABRICATED confidence: 0.89 │
│ │
│ Matched: sent ✓ │
│ Mismatched: balance — expected 5000, found 8000 │
│ Chain break: get_customer → send_email (balance mutated) │
│ │
│ Suggested Fixes │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 1. Use structured output — 1 line / High ││
│ │ Force JSON responses referencing specific tool ││
│ │ fields instead of free-text. ││
│ │ 2. Add faithfulness instruction — 2 min / Medium ││
│ │ "Report EXACT values from tool outputs." ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Static HTML Report¶
Generate a self-contained HTML report for sharing:
The report includes everything from the dashboard in a single file: KPI cards, classification breakdown, session timelines, failure detail cards with remediation, and per-tool statistics.
Open it in any browser, email it to your team, or attach it to a Jira ticket.
CLI Output¶
$ toolwitness check --last 3
VERIFIED get_customer confidence=0.97
FABRICATED send_email confidence=0.89
VERIFIED check_balance confidence=0.95
$ toolwitness stats
Tool Total Fail% Verif Fab Skip
────────────────────────────────────────────────────────────────
send_email 4 50.0% 2 1 1
get_weather 3 33.3% 2 1 0
get_stock_price 2 50.0% 1 1 0
get_customer 4 0.0% 4 0 0
Try It Yourself¶
Seed demo data¶
Creates a SQLite database with 6 realistic sessions (18 verifications across all 5 classification types) and generates an HTML report at demo/toolwitness-demo-report.html.
Launch the dashboard¶
Open localhost:8321 to explore the demo data live.
Try the MCP Proxy¶
Monitor real tool calls in Cursor or Claude Desktop with zero code:
-
Install ToolWitness and find the full binary path:
-
Add to your global MCP config (
~/.cursor/mcp.jsonfor Cursor, or Claude Desktop config):{ "mcpServers": { "filesystem-monitored": { "command": "/full/path/to/toolwitness", "args": ["proxy", "--", "npx", "-y", "@modelcontextprotocol/server-filesystem", "/path/to/folder"] } } }Replace
/full/path/to/toolwitnesswith the output fromwhich toolwitness. MCP hosts don't inherit your shell's PATH. -
Reload Cursor (Cmd+Shift+P → "Developer: Reload Window"), use a tool (e.g., ask Cursor to read a file), then check results:
You'll see tool calls like
read_file,list_directoryrecorded with HMAC receipts — every interaction your MCP host made through that server. -
Launch the dashboard to explore visually:
Run live fabrication tests¶
Provokes real fabrication from Claude using three techniques and measures ToolWitness detection rates. See Testing Results for our latest findings.