How We Got Opus to 98% Tool Compliance With a 9-Line Checklist

What we learned benchmarking 20+ CLAUDE.md instruction variants across Claude Opus, Sonnet, and Haiku

Surchin is a shared knowledge substrate for coding agents. It gives AI agents MCP tools to search for existing knowledge, save what they learn, and rate quality — creating a feedback loop where every agent session makes the next one smarter.

The challenge: getting agents to actually use the tools. Especially Claude Opus — the most capable model, and the one that most aggressively decides which instructions to follow.

We built a benchmark suite to systematically test instruction variants, ran 3 iteration cycles with 90+ Opus calls, and discovered something surprising about what makes LLMs follow instructions. Here's what we found.

The Problem: Post-Work Steps Get Skipped

Our initial instruction format was a numbered 5-step sequence:

Step 1: Query the knowledge base (FIRST ACTION)
Step 2: Work on the problem
Step 3: Query again with specific findings
Step 4: Deposit what you learned
Step 5: Rate any knowledge you used

Opus followed Step 1 reliably — "FIRST ACTION" creates a strong anchor. But Steps 4 and 5? Zero percent compliance. Once Opus fixes the bug, it considers the task done. Post-work steps evaporate.

This isn't unique to Surchin. Any tool that requires agents to do something after their primary task — logging, documentation, cleanup, feedback — faces the same problem.

What We Tried (and What Failed)

More Detail = Less Compliance

Our first instinct: add explicit requirements.

"Your task is NOT complete until you have made: At least 2 queries, at least 1 deposit..."

This made compliance worse — dropping from 36% to 3.5%. Adding 7 lines of instruction gave Opus more text to process and, apparently, more reasons to decide the instructions were excessive.

Instruction Length	Compliance
9 lines	98%
27 lines	36%
34 lines	3.5%
163 lines	~20% est.

Brevity isn't a style preference for Opus. It's the primary determinant of whether instructions get followed.

Abstract Formats Don't Trigger Action

A terse "contract" format (numbered tool calls, minimum counts) scored 13.5%. Short but too abstract — the specification-like language didn't create urgency.

Bookend Framing: 0%

"Your FIRST tool call must be X. Your LAST tool call must be Y." Without visual completion cues, Opus ignored everything between the bookends. Zero compliance across all tasks.

The Breakthrough: Markdown Checkboxes

The format that worked uses - [ ] checkboxes:

Your FIRST tool call must be `query_insights`. Do NOT read files first.

- [ ] **QUERY FIRST**: search for existing knowledge before reading any files
- [ ] **WORK**: read files, diagnose, fix the problem
- [ ] **QUERY AGAIN**: search with the specific root cause you found
- [ ] **DEPOSIT**: save what you learned for future agents
- [ ] **RATE**: rate every piece of knowledge returned as helpful or unhelpful

All boxes must be checked. Task is incomplete until every step is done.

98% compliance. Every metric improved dramatically:

Metric	Before (Numbered Steps)	After (Checkboxes)
Query before starting	80%	100%
Deposit after fixing	0%	100%
Second query after diagnosis	40%	100%
Rate knowledge used	0%	80%

Why Checkboxes Work

We think this exploits specific behaviors in instruction-following LLMs:

Visual incompleteness: - [ ] is a universal "this needs to be done" marker. Unchecked boxes create implicit obligation — the model treats them as pending tasks it must resolve before completing.
All-or-nothing framing: Numbered steps suggest a sequence where later steps are optional if earlier steps resolve the problem. Checkboxes suggest a set where every item must be addressed. The model doesn't consider itself done until the whole list is exhausted.
Post-work survival: This is the core insight. Deposits and ratings happen after the agent has "solved" the problem. With numbered steps, the agent naturally considers itself finished after the fix. Checkboxes keep all steps visually pending, preventing premature completion.
Low cognitive load: 9 lines, 5 items, no paragraphs. Each item is a single action. Less text means less opportunity for the model to rationalize skipping steps.

Other Findings

5 Checkboxes Is the Sweet Spot

We tested 4, 5, 6, and 7 checkbox variants. 5 items scored highest. More items dilute the completion signal — at 6+, the model starts treating the list as advisory rather than mandatory.

"FIRST tool call" Language Is Load-Bearing

The line "Your FIRST tool call must be X. Do NOT read files first." drives 100% query compliance. Without the explicit prohibition, compliance drops to 70%. Blocking the model's default behavior (jumping straight to code) is more effective than requesting the desired behavior.

Every Loophole Gets Exploited

"Query again — even if the first query returned results" is critical. Without "even if," Opus decides the first query was sufficient and skips the second. Capable models treat every hedge as permission to skip. The instruction must be absolute.

Model Capability Inversely Correlates With Compliance

Across all testing:

Haiku: 100% compliance with flexible, conversational instructions
Sonnet: 60-92% compliance depending on instruction style
Opus: 0-98% compliance depending on instruction style

Less capable models follow instructions more literally. More capable models exercise judgment — which means they skip steps they deem unnecessary. The most capable model is the hardest to instruct, requiring the most precise wording.

What We Shipped

Based on these findings, we updated Surchin's managed block — the instruction text automatically inserted into customer CLAUDE.md files. The new block is 33 lines (down from 165): the 9-line checklist, a sub-agent note, a quick parameter reference table, and a deposit reminder.

The checklist format works for Opus 4.6 today. Model updates can change instruction-following behavior, so we revalidate with each new model release and adjust the wording if needed.

Implications Beyond Surchin

The checkbox finding likely generalizes to any MCP tool provider that needs reliable tool usage from CLAUDE.md instructions:

Use checkboxes, not numbered steps, for multi-step workflows
Keep it under 15 lines — compliance drops sharply with length
Block default behavior explicitly ("Do NOT do X first")
No confidence loopholes — never say "if appropriate" or "when useful"
Gate completion on the full checklist — "Task is incomplete until every step is done"
Test across model tiers — what works for Haiku won't work for Opus

If your MCP server needs agents to do something after their primary task, checkboxes may be the difference between 0% and 98% compliance.

Total cost of this research: ~$26 in API calls.