Two Kinds of Wrong: When Your Agent's Data Source Looks Healthy But Isn't

My market-risk pipeline was telling me the Strait of Hormuz was calm. The actual data, when I finally read it correctly, said: 9.1 out of 10, critical, day 75 of an active closure, Brent at $122. The bug was not a crash. It was silence.

Once the silent-failure fix was shipped and live, my collaborator asked the question I should have asked myself: “is the hormuz data real and verified?” That single question surfaced a second, completely different bug in the same data source. Different shape, different mitigation, both worth designing against.

I want to write down both, in order, because the difference between them is the kind of distinction that separates the system runs from the system is trustworthy.

The system

I run a daily AI-built market report. It pulls live data from yfinance (80+ tickers), the FRED API, Coinbase, openFDA, the EIA, and the Hormuz Monitor — a small commercial geopolitical-risk feed that scores Strait-of-Hormuz disruption from 0 to 10. It runs an in-house multi-stage supply chain cascade model, generates static HTML, deploys to GitHub Pages via GitHub Actions, and rebuilds every twelve hours.

The current backdrop matters. In real life, since late February 2026, the Strait of Hormuz has been effectively closed — Iran’s partial closure on February 17, US-Israel strikes around March 1-2, traffic collapse of 80%, around 150 vessels stranded. War-risk insurance has been suspended for the Persian Gulf since March. This is the largest oil-market disruption since 1973.

My report has been live through all of it. The Hormuz Monitor integration was wired weeks ago. Tests passed. Reports shipped. Looked fine.

Bug one: the silent failure

I was doing a routine deprecation cleanup on the GitHub Actions side when I happened to notice a log line in a CI run:

Hormuz Monitor: risk=0.0 (unknown)
Cascade: 6 active, 0 projected/not_started

I was about to close the tab. Risk score zero, level unknown — must be a calm state. That instinct deserved a second look, so I called the API directly to verify.

Three layers of bug, in order of how much they hurt:

Wrong auth scheme. My client sent Authorization: Bearer {key}. The API expected X-API-Key: {key}. Every request was returning HTTP 401. My client swallowed the error and returned a HormuzSnapshot() with all fields at their dataclass defaults — risk_score=0.0, risk_level="unknown". Silent failure mode 1.

Wrong response shape. Even with correct auth, the API wraps every response as {"status": "success", "data": {...}}. My parser was reading risk.get("risk_score") directly off the envelope, ignoring the data wrapper. Even a real 200 OK would have given me all-default fields. Silent failure mode 2.

Wrong URL. The vendor’s marketing page advertises https://api.hormuzmonitor.com/v2. That host has no DNS A record. NXDOMAIN from every public resolver. The actual API turned out to be at https://mhh.gic.mybluehost.me/wp-json/hlapi/v2, referenced on a different page of their docs. Silent failure mode 3.

All three bugs were masked by the cascade’s graceful degradation. With Hormuz returning silent zeros, the cascade fell back to oil-price proxies for Stage 1. The report shipped. No exception. No alarm. No way for anyone reading the report to know they were looking at half the inputs.

I fixed all three, dispatched a CI run, and watched the cascade pick up — for the first time — Hormuz risk score 9.1/10 (critical) as evidence in Stage 1.

What I shipped against this

Three things, all small, all designed so the same class of bug cannot recur silently:

A workspace rule. A five-item contract every external integration in my workspace should satisfy: smoke-verify at integration time (paste the raw JSON in chat before merging), distinguish unreachable from unauthorized from valid-but-quiet (three states, three log messages), never silently return defaults from a try/except, ship a fixture test against the real wire format (not docs-derived JSON), and surface absence to the reader. That last item turned into a small “Cascade data sources” footer with a green/yellow/red dot per integration, right under the supply-chain section. The artifact itself now says “Hormuz: live” or “Hormuz: unreachable” so the reader is the last line of defense.

An on-demand audit skill. A walker that probes every external integration in the workspace, classifies each as live / unauthorized / unreachable / silently-degraded, and writes a dated audit report to the journal. Manual invocation, not a cron — but it exists for the moments when a report’s numbers feel off.

Eight fixture-based tests pinned against the actual JSON I observed once the API started talking. They would have caught all three of the original bugs in seconds. They run on every CI build now.

By the end of that pass, the live page was showing the new Hormuz signal, a green dot in the footer, and 77 tests passing.

Bug two: live, well-shaped, and still off

That’s when my collaborator asked: “is the hormuz data real and verified?”

The right instinct. I had verified the integration was working — the API returns data, the parser handles it correctly, the cascade now includes Stage 1 evidence. I had not verified that the data the API returns is grounded in reality. Those are different questions.

I cross-checked three ways.

The crisis itself: real and corroborated. Major outlets — CNBC, BBC, NPR, Al Jazeera — all confirm the timeline. Iran partially closed Feb 17, US-Israel strikes ~Mar 1-2, effective closure since with ~80% traffic collapse and ~150 ships stranded. The API’s crisis_active: true, day 75 of "2026 Hormuz Closure" is consistent with reality.

The vendor: small but plausible. Their docs claim AIS, MarineTraffic, Kpler, S&P Global, EIA, IEA, and UKMTO as upstream sources. Free tier 60 requests/hour, Pro $49/month. Not a hobbyist; small commercial outfit on shared hosting.

The numbers: drifting. This is the catch. The API says Brent crude is at $122.40, +57% YTD. Independent sources — TradingEconomics, BarChart, yfinance BZ=F, and my own report’s headline KPI — all agree Brent is around $106 today, monthly average $108, range $102-$118. The API’s number is ~14% off vs. ground truth. Consistently. It is not a one-time glitch; the WTI number is similarly out of step.

So the API is genuinely live, the qualitative signals are corroborated, and the specific dollar amounts disagree with independent sources. That is a different shape of bug from the silent failure, with a different mitigation.

The colored-dot footer I had just shipped — green for “live”, red for “unreachable” — would have marked this integration green indefinitely.

What I shipped against this

A second small layer, for a second kind of wrong.

I added a sixth item to the workspace rule: every external claim that overlaps with a known second source should be cross-validated at build time, with drift surfaced in the artifact. Drift > 10% turns the footer dot yellow. Drift > 20% turns it red even if the source is live and well-shaped.

Then I built src/analysis/cross_source.py. A small module, an OVERLAP_MATRIX, and a validate(snapshots) function that runs every CI build. It currently knows two overlaps:

hormuz.brent_usd ↔ yfinance.BZ=F.price
hormuz.wti_usd ↔ yfinance.CL=F.price

The validator compares the two values, computes drift, classifies as ok | drift_warn | drift_fail | unavailable, and feeds the result into the existing footer-dot color logic and a one-line “Cross-source validity” sub-line that surfaces only when something is actually drifting. Each run’s record is appended to a validation_log.jsonl file so I can watch trends over time.

I dispatched a CI run, waited for the deploy, and pulled up the live page. The Hormuz Monitor dot is now yellow. Below the footer, in dim text:

Cross-source validity: Hormuz Brent vs yfinance BZ=F: 120.24 vs 105.78 (+13.7%)

Eight new tests pin the actual 2026-05-13 drift values as a regression guard. 85 tests passing total, was 77.

What I learned

Integration health and data validity are not the same problem. A presence-only check (does the API return non-default data?) and a validity check (do its values agree with reality?) catch different failure modes. Most monitoring effort lives on the presence side. The validity side is where surprises hide.

Graceful degradation is the silent-failure path. Every fallback I’d designed to keep the system running when an input was missing also let the system keep running when an input was off in a way I hadn’t anticipated. The fix isn’t more robust fallbacks. It’s making absence and disagreement structurally visible in the artifact, not just in logs no one reads.

The reader is the last line of defense. If the agent and the dev both miss a silent failure, the artifact itself should say “Hormuz: drifting” — so anyone reading the report knows the inputs are not as clean as they look.

Skills don’t run reliably without invocation. That’s why both fixes shipped as in-process checks that run automatically every CI build, not as audit skills you have to remember to call. Skills are still in the workspace for deeper sweeps; they’re a complement, not the primary control.

What I’d build next, if I had a team

This whole pattern — “watch the difference between expected output structure and observed output structure across runs of every tool call” — is exactly what I’ve been building ToolWitness for. A baseline assertion like “fetch_hormuz_data() must return a snapshot with risk_level != 'unknown' for at least one run per day” would have caught the silent failure on day one. A baseline assertion like “hormuz.brent_usd must agree with yfinance.BZ=F within 10%” would have caught the drift the moment it appeared.

I have not wired ToolWitness into the financial agent yet. This case has the right shape for it: a real failure, a real follow-up failure, and a clear line from the problem to the SDK that exists to solve it. That’s on the roadmap.

For now, the workspace has a six-step contract where it had a five-step one before. The live report has a yellow dot under the cascade that wasn’t there earlier. The next time a vendor’s data source is silently absent, or off in a way that doesn’t trip a presence check, the artifact itself will say so.

I’ll take “the report tells you when its inputs disagree” over “trust me, it’s fine.”