A recent two-bug incident in a personal project — a market-risk pipeline talking to a third-party Strait-of-Hormuz data feed — taught me that “the API is responding” and “the data is right” are different questions, and I’d been treating them as one. That’s the same incident I wrote up as “Two Kinds of Wrong”. This post is the practical companion: the contract I now expect every external integration in my workspace to satisfy, the supporting infrastructure that makes the contract enforceable, and what the artifact itself shows the reader when something is off.
It’s narrow on purpose. This is about deterministic code calling a third-party API and keeping that integration honest. Agent-to-tool faithfulness — whether an LLM’s natural-language response actually represents what its tools returned — is a different problem with different solutions, and not what this contract covers.
Two surfaces
Most of the monitoring effort I see in personal and small-team systems lives on presence: did the API respond? Was the response well-shaped? Did the parser populate the dataclass without crashing?
Presence checks are necessary, and they’re most of what people ship. They’re also not enough. The Hormuz integration passed every presence check I had and still returned silently-defaulted values for an extended period because of a chain of three bugs — wrong auth header, wrong response shape, wrong URL — that combined to look like a calm/empty state. And once those bugs were fixed and the API was answering correctly, the values themselves turned out to be ~14% off from independent ground truth.
So there are two surfaces, and I now treat them as separate:
- Presence — is the integration reachable and well-shaped right now?
- Validity — do the values agree with reality?
The contract below covers both, but it’s important to see them as two different jobs.
The six-item contract
This is the rule file I keep at .cursor/rules/external-data-source-integrity.mdc in my workspace. It’s marked alwaysApply: true so the agent picks it up on every relevant edit. The full text of the contract is the six items below.
1. Smoke-verify at integration time
Before merging the wiring, run the integration against the real API with the real key, and paste the raw JSON response in chat. Not a summary. Not “I called it and it returned a snapshot.” The actual bytes.
This single step would have caught two of the three Hormuz bugs in seconds. A wrong auth header returns a JSON error you can see. A wrong response shape is obvious when the actual envelope is in front of you and your parser is reading the wrong key.
2. Distinguish three states explicitly
Every external client should treat these as three different things, with three different log messages and a snapshot-level flag the consumer can read:
- Unreachable — DNS fails, 5xx, timeout, connection refused
- Unauthorized / misconfigured — 401, 403, malformed key
- Valid response, quiet state — 200 with legitimately zero or empty data
Two snapshots with risk_score=0.0 are not the same snapshot if one came from a real 200 OK and the other from a silent 401. Conflating them is what makes a silent failure silent.
3. Never silently return defaults
A try/except around an API call must not return a populated dataclass with all fields at their schema defaults. That hides outages.
What’s acceptable: return None and let the caller handle “no snapshot this run”, return a snapshot with an explicit status: "ok" | "unreachable" | "unauthorized" field, or raise a typed exception. What’s not:
except Exception:
return SnapshotDataclass()
That’s the exact pattern that hid the Hormuz failure.
4. Fixture test against the real wire format
A unit test using a recorded JSON response in the actual response shape is mandatory. Not invented JSON. Not docs-derived JSON. The actual bytes captured in step 1.
Save under tests/fixtures/<source>_<endpoint>.json. Assert that the parser populates every field the snapshot dataclass declares. If the upstream response shape changes, the test fails loudly on the next run instead of silently returning defaults.
For the Hormuz fix, this turned into eight tests pinned against the real JSON. Each takes milliseconds in CI. They would have caught all three of the original bugs in seconds.
5. Surface absence to the reader
If the artifact (report, dashboard, output file) consumes data from this integration, the artifact itself must show whether the integration is currently live. A small “data sources” footer with green / yellow / red dots per integration is enough.
The reader is the last line of defense. If both the developer and the agent miss a silent failure, the report itself should say “Hormuz: unreachable” so anyone making decisions on it knows the inputs are incomplete before they act.
6. Cross-validate against an overlapping source
This is the item I added alongside the first five, once I realized presence-only validation isn’t enough.
For every external claim that overlaps with a known second source, the integration must cross-validate at build time and flag drift in the artifact. Two thresholds: drift > 10% turns the source’s footer dot yellow; drift > 20% turns it red, even if the source is otherwise live and well-shaped.
In my codebase, this lives in src/analysis/cross_source.py. It’s a small module with an OVERLAP_MATRIX of registered checks, a validate(snapshots) function, and a JSONL log per run. Two checks today:
hormuz.brent_usd↔yfinance.BZ=F.pricehormuz.wti_usd↔yfinance.CL=F.price
The result feeds the colored-dot footer from item 5, plus a one-line “Cross-source validity” sub-line that surfaces only when something is actually drifting.
If a source has no overlap available, document it. An integration shipped with no second source isn’t a free pass — record the gap so it’s clear that field is trusted on faith, and budget for a second source over time.
Two pieces of infrastructure that make the contract enforceable
A rule that nobody runs is not a rule. Two complementary pieces of supporting infrastructure carry the load:
An on-demand audit skill — data-source-health-audit — walks every integration in the workspace, classifies each as live / unauthorized / unreachable / silently-degraded / drifting, and writes a dated report to my journal. Manual invocation. I run it when a number feels wrong, or when I’m adding a new integration and want a baseline.
An in-process cross-source validator — runs as part of every CI build, not on a schedule. Same logic the audit skill calls; just always-on for the production path. The result drives the colored-dot footer color and is appended to data/validation_log.jsonl for trend tracking.
Two cadences, on purpose. The in-process check fires reliably and is reflected directly in the published artifact. The skill exists for the deeper sweeps, the moments when something feels off but the in-process check came back green, and the act of adding a new integration.
What the reader sees
The end-user-visible output of all of this is unremarkable, which is the point. Under the supply-chain section of the report, a small footer:
Data sources: ● market data · ● macro · ● Hormuz · ● FDA · ● EIA
Cross-source validity: Hormuz Brent vs yfinance BZ=F: 120.24 vs 105.78 (+13.7%)
Green dot, the source is live and within tolerance. Yellow dot, drifting. Red dot, the integration is down or > 20% off. The cross-source line only appears when there’s something to surface.
If you’re reading the report and the Hormuz dot is yellow, you have all the context you need to discount that input without hunting through logs.
What this contract does and doesn’t cover
Worth being explicit:
- This is for deterministic code reading third-party APIs. If you’re building an LLM agent that calls tools and produces natural-language claims, you also need to watch agent–tool faithfulness — whether the agent’s response represents what the tool actually returned. Different shape of bug, different mitigations, not what this contract is for.
- This is a floor, not a ceiling. Real production observability adds retries, circuit breakers, on-call, and anomaly detection on top. The six items are about not shipping silent failure into a personal or small-team system.
- Item 6 only fires for fields with a known second source. For exotic data with no overlap, the only honest stance is to document the gap and budget for a second source over time.
If you want to adopt this
The six items above are the full rule — copy and adapt to your stack. The on-demand audit skill is the same logic running in inspect mode rather than enforcement mode; it’s a thin wrapper around a directory walk plus per-source probes. The cross-source validator is small enough to write from scratch — start with one overlap pair, then grow an OVERLAP_MATRIX as you add sources. The version that matters is the one that lives where you can actually run it.