Skip to main content

Agents that remember outperform agents that don't

Measured across 324 agent runs on 3 real-world ecosystems, validated with Wilcoxon signed-rank tests. Here's what happens when agents share knowledge through Surchin.

Full Methodology & Reproducibility Details

Key Metrics

Early Results
23.5% faster
Time to Resolution
Pre-Seeded vs Control, Android/NIA ecosystem
100%
Repeat-Fix Success Rate
Android & TypeScript ecosystems, all passes
76%
Knowledge Hit Rate
Week 5 accumulated knowledge base
21.3%
Cost Reduction
Pre-Seeded KB vs Control (p<0.001, Wilcoxon signed-rank)

Three-Ecosystem Pattern-Repetition Study

324 agent runs across Android, TypeScript, and Python ecosystems. Three passes per task: Control (no Surchin), Cold-Start (empty KB), Pre-Seeded (populated KB). Each ecosystem repeated 3x. Total spend: $131.

7-21%
cost reduction
Pre-seeded KB vs control
100%
resolution rate
Android & TypeScript ecosystems
3x
lower cost variance
More predictable agent behavior
324
agent runs
Across 3 ecosystems, 3x repeated
EcosystemResolutionCost ReductionTime Reductionp-value (cost)Total Spend
Android / NowInAndroid100%21.3%23.5%p<0.001$16.25
TypeScript / Cal.com100%12.8%1.1%p=0.079$34.75
Python / FastAPI66.7%7.1%13.6%p=0.052$48.80

Android / NowInAndroid

Cleanest results: 100% resolution across all passes. Kotlin/Gradle ecosystem.

Control$0.2307
67s
Cold-Start$0.1896
60s
Pre-Seeded$0.1815
51s

TypeScript / Cal.com

100% resolution. Cost trend present but not yet significant (p=0.079).

Control$0.4659
63s
Cold-Start$0.4150
65s
Pre-Seeded$0.4063
62s

Python / FastAPI

Most complex tasks. Cost savings present but just outside significance threshold.

Control$0.6464
149s
Cold-Start$0.5608
122s
Pre-Seeded$0.6004
129s

Configuration: Opus 4.6, max 25 turns, 40 min timeout, 3 repetitions per ecosystem. Statistical significance via Wilcoxon signed-rank test.

21.3%
cost reduction (p<0.001)
Pre-seeded KB vs control (Android)
100%
resolution rate
Android & TypeScript ecosystems
7-21%
cost savings across ecosystems
Pre-seeded KB vs control
3x
lower cost variance
More predictable agent behavior
324
agent runs across 3 ecosystems
Wilcoxon signed-rank validated

Values from the three-ecosystem benchmark study — live DB results will overlay when available.

Agent Performance Comparison

Side-by-side metrics from the Android/NowInAndroid ecosystem (100% resolution rate, both passes)

Control (No Surchin)
Avg cost per task$0.2307
Avg resolution time67s
Resolution rate100%
Cost varianceHigh
EcosystemAndroid/NIA
Pass typeControl
Pre-Seeded KB (With Surchin)
Avg cost per task$0.1815
Avg resolution time51s
Resolution rate100%
Cost varianceLow
EcosystemAndroid/NIA
Pass typePre-Seeded

Key Findings

What we learned from 243 agent runs

21.3%

Significant cost reduction (Android)

On the Android/NowInAndroid ecosystem, a pre-populated knowledge base cut agent costs by 21.3% with high statistical significance (p<0.001 via Wilcoxon signed-rank test).

3x

Lower cost variance = predictable agents

Pre-seeded runs show 3x lower cost variance than control runs. Agents with shared knowledge behave more consistently and predictably across repeated tasks.

100%

Resolution across two ecosystems

Both Android and TypeScript achieve 100% task resolution across all passes. Surchin reduces cost without sacrificing correctness.

$131

Full study cost

324 agent runs across 3 ecosystems, 3x repeated, using Opus 4.6 with max 25 turns and 40-minute timeout. Affordable, reproducible evaluation.

Knowledge Compounds Over Time

Watch the metrics improve as your knowledge base grows week over week

Knowledge Base Size

Week 1
42 entries
Week 2
127 entries
Week 3
318 entries
Week 4
612 entries
Week 5
847 entries

Query Hit Rate

Week 1
12%
Week 2
34%
Week 3
58%
Week 4
68%
Week 5
76%

Avg Resolution Time

Week 1
13.8 min
Week 2
10.2 min
Week 3
6.4 min
Week 4
4.4 min
Week 5
3.6 min

Knowledge Retention & Reuse

How knowledge flows through the lifecycle

4.7x
Reuse rate
Avg times a solution is reused before decaying
31%
Promotion rate
Entries promoted from draft within 30 days
44%
Natural decay
Entries that decay unreferenced — keeping the base lean

Knowledge Lifecycle

Deposited
100%
Queried
56%
Helpful
38%
Promoted
31%
Reused 4.7x avg
4.7x

Instruction Compliance Research

We spent $39 and 160 agent runs to find the CLAUDE.md format that maximizes tool compliance without sacrificing quality.

~160
agent runs
Across 3 phases on Claude Opus
78-82%
overall compliance
Production checklist (v2), stable across runs
5
checkboxes optimal
4 drops targeted query compliance 80→…30%
~$39
total research cost
Phases 1–3: instruction format, agent feedback, quality coaching
VariantQueryDeposit2nd QueryRateOverall
v2 (production)Production100%70%80%70%82.5%
v2.1 (reframed, “skip if” language)100%0%70%40%53.0%
4-item (fewer checkboxes)100%0%30%60%47.0%
v2-quality (coached deposits)100%60%80%75%78.5%
Critical pitfall

“Skip if” language kills deposits

Any instruction with “skip if obvious” or “skip for general knowledge” produces 0% deposit compliance. Opus treats everything as general knowledge. Instructions must be unconditional.

Format finding

5 checkboxes is the sweet spot

Reducing from 5 to 4 checkboxes dropped targeted second-query compliance from 80% to 30%. The visual incompleteness of unchecked boxes drives action — fewer boxes means less pressure.

Quality finding

Quality is inherent, compliance is instructed

When agents deposit, quality is ~80% regardless of instruction wording. Coaching “include root cause and specific files” doesn’t change behavior — Opus already does it. The instruction’s job is to drive compliance, not quality.

Stop solving the same bugs twice

Give every AI agent access to your organization's institutional knowledge.