I Audited My Own Context Engineering Setup and Found It Barefoot

On March 30, I published a post called “Attention Is All You Need — But It’s Not All You Control.” It laid out 8 components of an AI model’s context window — system prompt, conversation history, retrieved documents, tool descriptions, few-shot examples, memory, metadata, and user message — and argued that most people only optimize the last one.

The next day, Victoria asked me a question I should have asked myself: “Are we doing any of these context engineering steps here?”

She wasn’t asking whether we understood the concept. She was asking whether we practiced it. So we did what any self-respecting builder would do: we audited ourselves.

The audit

We mapped every .cursor/rules/ file, every skill, every memory artifact, every context-loading mechanism in the workspace to the 7 controllable components from the blog post (component 8 — the user message — is always present). The results:

Component	What we found	Verdict
System prompt	28 always-on rules with a priority hierarchy. Philosophy, security, journaling, coaching role, backlog conventions.	Strong
Conversation history	A `context-health.mdc` rule that detects degradation — fires warnings when turns pile up or compaction eats detail.	Defended
Retrieved documents	Manual RAG via `CONTEXT_INDEX.md` — a table of contents for agent memory, loaded selectively by topic at session start.	Adequate
Tool descriptions	20 skills, 11 slash commands, hooks, and MCP server integrations. Each with structured descriptions the agent can discover.	Strong
Few-shot examples	Nothing. No examples directory. No calibration files. No “here’s what good looks like.”	Weak
Memory	6 layers: journal entries, session handoffs, knowledge base, agent accomplishment log, North Star, active context. Freshness-tracked.	Very strong
Metadata	Timestamps, user profile, date verification rule, rhythm detection, environment info.	Adequate

Six of seven components were solid. One — few-shot examples — was almost completely absent. We wrote a blog post about 8 components and built for 7.

Classic cobbler’s children.

What “weak” actually meant

Few-shot examples are input-output pairs included in the context window to demonstrate desired behavior. They’re the difference between telling an agent “write good journal entries” and showing it what a good journal entry looks like next to a weak one.

Our workspace had 28 rules telling the agent what to do. Zero examples showing how. Rules say “be specific in the Daily Flow section.” An example shows a specific Daily Flow entry with real file paths, real CSS values, real build outcomes — and a weak one that says “fixed some issues” — side by side.

Without examples, the agent had to infer quality from instruction text alone. It’s like giving a new employee a 40-page policy manual and no sample deliverables. They’ll follow the rules. They might not produce what you actually want.

Fixing it in one session

We created a .cursor/examples/ directory with 4 calibration files, all extracted from real workspace artifacts — not invented:

journal-entry.md — good-vs-weak Daily Flow entries, Ideas sections, and Backlog candidates. The good example came from an actual Saturday journal with specific CSS values (clamp(1rem, 4vw, 1.8rem)), named files (report.py), and deployment outcomes (4 pushes, all GitHub Actions green). The weak example said “worked on mobile fixes.”
session-narrative.md — a real handoff narrative vs. a bullet-point state dump. The good one reads like a letter to the next agent. The weak one reads like a database export.
backlog-item.md — actual BACKLOG.md rows. Good: specific titles, linked purposes, parent references. Weak: vague “improve things” entries with no priority rationale.
blog-voice.md — voice characteristics extracted from the attention-vs-context post itself. Opening hooks, specificity patterns, grounding in workspace practice, patterns to avoid (“In this post, we’ll…”).

Every example came from work we’d already done. Real file paths. Real dates. Real outcomes. That grounding matters — the agent sees what actually happened, not what we imagine should happen.

Then we wired references into existing rules and skills:

journaling-rules.mdc got a “Quality calibration” line pointing to the journal examples
session-handoff/SKILL.md got an “Examples” section pointing to the narrative file
backlog-prioritization.mdc got a pointer to the backlog examples
content-distributor/SKILL.md referenced the blog voice file

The wiring pattern was deliberate: reference, don’t inline. Adding full examples to an always-on rule would bloat every conversation’s context window. A one-line reference to a loadable file keeps the baseline lean. The agent loads examples when it needs them — journaling examples when writing a journal entry, voice examples when drafting a blog post — instead of carrying all of them in every conversation.

Components are a network, not a checklist

Here’s what surprised us: fixing component 5 (few-shot examples) immediately improved components 1, 4, and 6.

The system prompt rules (component 1) became more effective because the agent could now see what those rules look like when executed well. “Be specific in journal entries” plus a specific journal example produces better output than the instruction alone.

The tool descriptions (component 4) — skills and commands — became more precise because skills could reference examples of their expected output format. A handoff skill that points to a real handoff narrative gives the agent a target, not just instructions.

The memory artifacts (component 6) — journal entries, session narratives, backlog items — improved in quality because the agent had calibration points for each type.

The 7 components aren’t a checklist where you optimize each one independently. They’re a network. Strengthening one node strengthens the edges connected to it. A few-shot example that demonstrates how a rule should be executed ties two components together. A memory artifact that matches the quality of an example reinforces the pattern for next time.

This also connects to something we’d been learning about context rot — the research-backed finding that LLM accuracy degrades as context grows. More tokens in the window isn’t better. More relevant tokens is better. The few-shot examples we added are tiny files (a few hundred words each), loaded only when needed, but they carry high signal density. The references-not-inline pattern keeps the always-on context lean while making the on-demand context richer.

The principle: retrieval over accumulation. Don’t load everything. Load the right thing at the right time.

Try this on your own workspace

If you’re working with AI agents in Cursor — or any tool where you design the context pipeline — here’s the audit in three questions:

Can you list what fills each component? Map your rules, files, tools, and memory to the 7 controllable components. If a component is blank, you’ve found your gap.
Which component has the weakest signal? “Weak” doesn’t mean absent — it means the agent has to guess instead of reference. If you have rules but no examples, the rules are working harder than they need to.
Are your examples from real work or invented? Fabricated examples train the agent on what you imagine quality looks like. Examples from actual sessions train it on what quality actually looks like in your workspace, with your conventions, your file names, your patterns.

We wrote a post about 8 context engineering components. We built systems for 7 of them. Then someone asked “are we actually doing this?” and we found the gap in an afternoon.

The framework works. But only if you turn it on yourself.

This post builds on the framework from “Attention Is All You Need — But It’s Not All You Control”. For how the workspace itself is structured: “Building With AI Agents: What I’ve Learned So Far”. Built in Cursor. Audited against our own advice.