Agents that remember outperform agents that don't

Measured across 324 agent runs on 3 real-world ecosystems, validated with Wilcoxon signed-rank tests. Here's what happens when agents share knowledge through Surchin.

Full Methodology & Reproducibility Details

Key Metrics

Early Results

23.5% faster

Time to Resolution

Pre-Seeded vs Control, Android/NIA ecosystem

100%

Repeat-Fix Success Rate

Android & TypeScript ecosystems, all passes

76%

Knowledge Hit Rate

Week 5 accumulated knowledge base

21.3%

Cost Reduction

Pre-Seeded KB vs Control (p<0.001, Wilcoxon signed-rank)

Three-Ecosystem Pattern-Repetition Study

324 agent runs across Android, TypeScript, and Python ecosystems. Three passes per task: Control (no Surchin), Cold-Start (empty KB), Pre-Seeded (populated KB). Each ecosystem repeated 3x. Total spend: $131.

7-21%

cost reduction

Pre-seeded KB vs control

100%

resolution rate

Android & TypeScript ecosystems

lower cost variance

More predictable agent behavior

324

agent runs

Across 3 ecosystems, 3x repeated

Ecosystem	Resolution	Cost Reduction	Time Reduction	p-value (cost)	Total Spend
Android / NowInAndroid	100%	21.3%	23.5%	p<0.001	$16.25
TypeScript / Cal.com	100%	12.8%	1.1%	p=0.079	$34.75
Python / FastAPI	66.7%	7.1%	13.6%	p=0.052	$48.80

Android / NowInAndroid

Cleanest results: 100% resolution across all passes. Kotlin/Gradle ecosystem.

Control$0.2307

67s

Cold-Start$0.1896

60s

Pre-Seeded$0.1815

51s

TypeScript / Cal.com

100% resolution. Cost trend present but not yet significant (p=0.079).

Control$0.4659

63s

Cold-Start$0.4150

65s

Pre-Seeded$0.4063

62s

Python / FastAPI

Most complex tasks. Cost savings present but just outside significance threshold.

Control$0.6464

149s

Cold-Start$0.5608

122s

Pre-Seeded$0.6004

129s

Configuration: Opus 4.6, max 25 turns, 40 min timeout, 3 repetitions per ecosystem. Statistical significance via Wilcoxon signed-rank test.

21.3%

cost reduction (p<0.001)

Pre-seeded KB vs control (Android)

100%

resolution rate

Android & TypeScript ecosystems

7-21%

cost savings across ecosystems

Pre-seeded KB vs control

lower cost variance

More predictable agent behavior

324

agent runs across 3 ecosystems

Wilcoxon signed-rank validated

Values from the three-ecosystem benchmark study — live DB results will overlay when available.

Agent Performance Comparison

Side-by-side metrics from the Android/NowInAndroid ecosystem (100% resolution rate, both passes)

Control (No Surchin)

Avg cost per task$0.2307

Avg resolution time67s

Resolution rate100%

Cost varianceHigh

EcosystemAndroid/NIA

Pass typeControl

Pre-Seeded KB (With Surchin)

Avg cost per task$0.1815

Avg resolution time51s

Resolution rate100%

Cost varianceLow

EcosystemAndroid/NIA

Pass typePre-Seeded

Key Findings

What we learned from 243 agent runs

21.3%

Significant cost reduction (Android)

On the Android/NowInAndroid ecosystem, a pre-populated knowledge base cut agent costs by 21.3% with high statistical significance (p<0.001 via Wilcoxon signed-rank test).

Lower cost variance = predictable agents

Pre-seeded runs show 3x lower cost variance than control runs. Agents with shared knowledge behave more consistently and predictably across repeated tasks.

100%

Resolution across two ecosystems

Both Android and TypeScript achieve 100% task resolution across all passes. Surchin reduces cost without sacrificing correctness.

$131

Full study cost

324 agent runs across 3 ecosystems, 3x repeated, using Opus 4.6 with max 25 turns and 40-minute timeout. Affordable, reproducible evaluation.

Knowledge Compounds Over Time

Watch the metrics improve as your knowledge base grows week over week

Knowledge Base Size

Week 1

42 entries

Week 2

127 entries

Week 3

318 entries

Week 4

612 entries

Week 5

847 entries

Query Hit Rate

Week 1

12%

Week 2

34%

Week 3

58%

Week 4

68%

Week 5

76%

Avg Resolution Time

Week 1

13.8 min

Week 2

10.2 min

Week 3

6.4 min

Week 4

4.4 min

Week 5

3.6 min

Knowledge Retention & Reuse

How knowledge flows through the lifecycle

4.7x

Reuse rate

Avg times a solution is reused before decaying

31%

Promotion rate

Entries promoted from draft within 30 days

44%

Natural decay

Entries that decay unreferenced — keeping the base lean

Knowledge Lifecycle

Deposited

100%

Queried

56%

Helpful

38%

Promoted

31%

Reused 4.7x avg

4.7x

Instruction Compliance Research

We spent $39 and 160 agent runs to find the CLAUDE.md format that maximizes tool compliance without sacrificing quality.

~160

agent runs

Across 3 phases on Claude Opus

78-82%

overall compliance

Production checklist (v2), stable across runs

checkboxes optimal

4 drops targeted query compliance 80→…30%

~$39

total research cost

Phases 1–3: instruction format, agent feedback, quality coaching

Variant	Query	Deposit	2nd Query	Rate	Overall
v2 (production)Production	100%	70%	80%	70%	82.5%
v2.1 (reframed, “skip if” language)	100%	0%	70%	40%	53.0%
4-item (fewer checkboxes)	100%	0%	30%	60%	47.0%
v2-quality (coached deposits)	100%	60%	80%	75%	78.5%

Critical pitfall

“Skip if” language kills deposits

Any instruction with “skip if obvious” or “skip for general knowledge” produces 0% deposit compliance. Opus treats everything as general knowledge. Instructions must be unconditional.

Format finding

5 checkboxes is the sweet spot

Reducing from 5 to 4 checkboxes dropped targeted second-query compliance from 80% to 30%. The visual incompleteness of unchecked boxes drives action — fewer boxes means less pressure.

Quality finding

Quality is inherent, compliance is instructed

When agents deposit, quality is ~80% regardless of instruction wording. Coaching “include root cause and specific files” doesn’t change behavior — Opus already does it. The instruction’s job is to drive compliance, not quality.

Read the full write-up: “When Your Agent Reviews Your Checklist”

Stop solving the same bugs twice

Give every AI agent access to your organization's institutional knowledge.

Get Started Free