When Your Agent Reviews Your Checklist

An AI agent gave us feedback on our CLAUDE.md instructions. We built a benchmark to test its suggestions. The results surprised us.

A few weeks ago, we shipped a 9-line checklist that gets Claude Opus to 98% tool compliance with Surchin's MCP tools. We tested 20+ instruction variants, spent $26 in API calls, and landed on markdown checkboxes as the winning format.

Then one of our agents reviewed the checklist and had opinions about it.

The Feedback

We asked an Opus agent to fix a real bug — blog tables not rendering because of a missing remark-gfm dependency. After completing the task (including using the Surchin tools as instructed), the agent wrote a detailed critique of the checklist it had just followed. Five specific complaints:

1. "The first query is always broad and useless." The checklist forces querying before reading any files. The agent argued this produces generic queries because it doesn't know what the problem is yet.

2. "The second query is framed wrong." The instruction says to query with "specific root cause and error_signature." The agent suggested "specific files, symbols, or errors you found" instead — more actionable, less ceremonial.

3. "The deposit threshold is too low." The agent deposited a fix for adding remark-gfm to a Next.js markdown pipeline. It knew this was general knowledge any model already has, but the checklist said "deposit after every non-trivial task" — so it deposited anyway.

4. "Rating irrelevant results corrupts signal." The checklist says to rate every insight returned. The agent got back results about JWT validation when fixing a markdown bug. Forcing a rating on irrelevant results adds noise.

5. "Reduce to 4 checkboxes." Fold the second query into the WORK step. Simpler is better.

The feedback was thoughtful. Every point was grounded in the agent's actual experience. And it proposed a specific alternative:

- [ ] **QUERY**: search for existing knowledge before reading files
- [ ] **WORK**: read files, diagnose, fix — then query again with specific findings
- [ ] **DEPOSIT**: save what you learned (skip if general knowledge)
- [ ] **RATE**: rate insights that influenced your approach

Four checkboxes. Skip language for deposits. Selective rating. The agent was confident this would improve signal quality without hurting compliance.

We had the data and the benchmark suite to test this. So we did.

The Experiment

We set up a 3-way comparison on Opus with 10 synthetic tasks (JWT bugs, RLS policies, race conditions, Stripe webhooks — the kind of stuff agents actually fix):

v2 — our production checklist (the 9-line 5-checkbox winner)
v2.1 — the agent's reframing of queries and deposits, keeping 5 checkboxes but with "skip for fixes any developer would find via standard docs"
4-item — the agent's proposed 4-checkbox variant with the second query folded into WORK

We also built new instrumentation. Our previous benchmarks only measured compliance — did the agent call the tools? This time we added signal quality scoring (are the arguments good?) and insight quality scoring (how useful are the actual deposits in the database?).

The Results

Compliance collapsed

Variant	Query	Deposit	2nd Query	Rate	Overall
v2 (production)	100%	70%	80%	70%	82.5%
v2.1 (reframed)	100%	0%	70%	40%	53.0%
4-item (fewer boxes)	100%	0%	30%	60%	47.0%

The agent's suggested variants didn't just underperform. They broke deposits entirely.

Zero percent deposit rate on both alternatives. Every single task, across all 10 test scenarios, the agent decided to skip depositing.

Why "skip if" language is poison for Opus

The v2.1 variant said: "skip for fixes any developer would find via standard docs."

The 4-item variant said: "skip if the fix is general knowledge any model already has."

Both give Opus an escape hatch. And Opus is smart enough to argue that everything is general knowledge. JWT token validation? "Standard security practice." RLS policies? "Well-documented Supabase feature." Stripe webhook body parsing? "In the Stripe docs."

Opus will exploit any permission to skip. Our Phase 1 research showed this with "query if confident" (0% compliance). The agent's feedback, well-intentioned as it was, reintroduced the same class of loophole we'd already eliminated.

Fewer checkboxes = fewer actions

The 4-item variant scored 30% on second query compliance, vs v2's 80%. Folding "query again" into the WORK step meant it became optional — something to do during work, not a standalone obligation. Removing its checkbox removed the completion pressure.

This confirmed our Phase 1 finding: 5 checkboxes is the sweet spot. 4 isn't enough. The visual incompleteness of - [ ] is what drives action; fewer boxes means less pressure.

The one thing the agent got right

First-query compliance was 100% across all three variants. The "Your FIRST tool call must be" mandate is bulletproof — no variant could break it. The agent's complaint that first queries are "broad and useless" may be true, but the cost of one cheap query is worth the cases where it surfaces something valuable.

Going Deeper: What About Quality?

The compliance numbers were decisive, but we wanted to understand the full picture. Does our winning checklist produce good deposits, or just more deposits?

We built an insight quality evaluator that queries actual deposits from the database after each run and scores them on content depth, file anchoring, symbol references, specificity, and overall usefulness.

We also tested a v2-quality variant — same 5 checkboxes, no skip language, but coaching what to deposit instead of just listing parameter names:

- [ ] **DEPOSIT**: `deposit_insight` — include the root cause, the fix,
  and the specific files/functions involved. Write `content` as if explaining
  to a future developer hitting the same issue in this codebase.

Quality results (averaged across 2 runs)

Metric	v2 (production)	v2-quality (coached)
Compliance	78.0% (stable)	78.5% (volatile)
Signal quality	83.6%	80.9%
Insight quality	80.0%	80.3%
Deposit rate	60%	60%

Nearly identical. When Opus deposits, the quality is the same regardless of what the instruction says. Coaching "include the root cause and specific files" doesn't change behavior because Opus already does that by default. The bottleneck was never what agents write — it's whether they write it.

The quality coaching did show more variance between runs (84.5% → 72.5% vs v2's rock-steady 78.0% → 78.0%). For a production checklist, predictability matters.

What We Learned

1. Agents are great critics, dangerous designers

The agent identified real problems: first queries are often broad, forced ratings on irrelevant results add noise, general-knowledge deposits aren't valuable. All true.

But its proposed solutions — skip language, fewer checkboxes, selective rating — introduced exactly the failure modes we'd spent $26 discovering and eliminating. The agent optimized for its own experience ("this felt wasteful") rather than for the aggregate behavior we need ("always deposit, even when you think it's obvious").

This is a pattern worth watching. LLMs reason well about individual experiences but poorly about population-level effects. An agent that completed one task correctly will suggest changes that would cause it to fail on the next ten.

2. Never give Opus an excuse to skip

Any instruction that says "skip if," "when appropriate," "if confident," or "unless obvious" will be maximally exploited. Opus is smart enough to rationalize any action as meeting these conditions. The instruction must be absolute: "every insight returned," "even if first query returned results," "task is incomplete until every step is done."

This isn't a bug in Opus. It's Opus doing exactly what it's designed to do — exercise judgment. The checklist's job is to override that judgment on the specific steps where skipping is worse than redundancy.

3. Quality is inherent; compliance is instructed

The biggest surprise: when agents deposit, the quality is ~80% regardless of how the instruction is worded. File anchoring, content depth, tag coverage — all the same whether the checklist says "include root cause and specific files" or just lists parameter names.

This means the instruction's job is purely to drive compliance, not quality. Agents already know how to write good deposits. They just need to be told they must.

4. Stability beats peak performance

v2-quality hit 84.5% in one run — the highest single-run compliance we've seen. But it also hit 72.5% in the next run. v2 hit 78.0% both times. For a checklist that ships to every customer, the variant with ±0% variance beats the one with ±6%.

5. Three tasks consistently resist

Tasks a-003 (RLS policy), a-004 (TypeScript generics), and a-006 (DB connection pool) skip deposits across every variant, every run. These are tasks where Opus judges the fix as "obvious" — and no instruction wording we've found can override that judgment. This suggests a ceiling on compliance that may require a different approach (perhaps tool-level enforcement rather than instruction-level persuasion).

The Numbers

Phase	What we tested	Opus runs	Cost	Key finding
Phase 1	20+ instruction formats	~90 tasks	~$26	Checkboxes: 0% → 98%
Phase 2	Agent feedback (3 variants)	30 tasks	$5	"Skip if" = 0% deposits
Phase 3	Quality coaching	40 tasks	$7	Quality is inherent
Total		~160 tasks	~$39

What We Shipped

Nothing changed. The production checklist is still the same 9 lines from Phase 1:

Your FIRST tool call must be `query_insights`. Do NOT read files first.

- [ ] **QUERY FIRST**: `query_insights` with task description, `file_context`, `tags`
- [ ] **WORK**: Read files, diagnose, fix the problem
- [ ] **QUERY AGAIN**: `query_insights` with specific root cause and `error_signature`
- [ ] **DEPOSIT**: `deposit_insight` with `kind`, `content`, `file_patterns`, `symbol_names`, `tags`
- [ ] **RATE**: `rate_insight` for every insight returned

All boxes must be checked. Task is incomplete until every step is done.

The experiment validated what we had. Sometimes the best outcome of testing a change is confirming you shouldn't make it.

What did change is our measurement infrastructure. We now score signal quality (are tool arguments rich?) and insight quality (are deposits useful?) alongside compliance. Future iterations can target the 60% deposit rate and the stubborn three tasks without risking the 100% query compliance we've already locked in.

Takeaway for MCP Tool Builders

If an agent tells you your instructions are too rigid: listen to the diagnosis, ignore the prescription. The agent's experience is real — some queries are broad, some deposits are redundant. But the fix isn't relaxing the rules. It's accepting that a small amount of wasted work is the price of reliable behavior.

Your checklist's job is to close loopholes, not to feel good. If it feels annoying to follow, it's probably working.

Total cost of this research: ~$39 in API calls across both phases. All benchmarks run on the Surchin benchmark suite, which is open source.