Measured across 324 agent runs on 3 real-world ecosystems, validated with Wilcoxon signed-rank tests. Here's what happens when agents share knowledge through Surchin.
Full Methodology & Reproducibility Details324 agent runs across Android, TypeScript, and Python ecosystems. Three passes per task: Control (no Surchin), Cold-Start (empty KB), Pre-Seeded (populated KB). Each ecosystem repeated 3x. Total spend: $131.
| Ecosystem | Resolution | Cost Reduction | Time Reduction | p-value (cost) | Total Spend |
|---|---|---|---|---|---|
| Android / NowInAndroid | 100% | 21.3% | 23.5% | p<0.001 | $16.25 |
| TypeScript / Cal.com | 100% | 12.8% | 1.1% | p=0.079 | $34.75 |
| Python / FastAPI | 66.7% | 7.1% | 13.6% | p=0.052 | $48.80 |
Cleanest results: 100% resolution across all passes. Kotlin/Gradle ecosystem.
100% resolution. Cost trend present but not yet significant (p=0.079).
Most complex tasks. Cost savings present but just outside significance threshold.
Configuration: Opus 4.6, max 25 turns, 40 min timeout, 3 repetitions per ecosystem. Statistical significance via Wilcoxon signed-rank test.
Values from the three-ecosystem benchmark study — live DB results will overlay when available.
Side-by-side metrics from the Android/NowInAndroid ecosystem (100% resolution rate, both passes)
What we learned from 243 agent runs
On the Android/NowInAndroid ecosystem, a pre-populated knowledge base cut agent costs by 21.3% with high statistical significance (p<0.001 via Wilcoxon signed-rank test).
Pre-seeded runs show 3x lower cost variance than control runs. Agents with shared knowledge behave more consistently and predictably across repeated tasks.
Both Android and TypeScript achieve 100% task resolution across all passes. Surchin reduces cost without sacrificing correctness.
324 agent runs across 3 ecosystems, 3x repeated, using Opus 4.6 with max 25 turns and 40-minute timeout. Affordable, reproducible evaluation.
Watch the metrics improve as your knowledge base grows week over week
How knowledge flows through the lifecycle
We spent $39 and 160 agent runs to find the CLAUDE.md format that maximizes tool compliance without sacrificing quality.
| Variant | Query | Deposit | 2nd Query | Rate | Overall |
|---|---|---|---|---|---|
| v2 (production)Production | 100% | 70% | 80% | 70% | 82.5% |
| v2.1 (reframed, “skip if” language) | 100% | 0% | 70% | 40% | 53.0% |
| 4-item (fewer checkboxes) | 100% | 0% | 30% | 60% | 47.0% |
| v2-quality (coached deposits) | 100% | 60% | 80% | 75% | 78.5% |
Any instruction with “skip if obvious” or “skip for general knowledge” produces 0% deposit compliance. Opus treats everything as general knowledge. Instructions must be unconditional.
Reducing from 5 to 4 checkboxes dropped targeted second-query compliance from 80% to 30%. The visual incompleteness of unchecked boxes drives action — fewer boxes means less pressure.
When agents deposit, quality is ~80% regardless of instruction wording. Coaching “include root cause and specific files” doesn’t change behavior — Opus already does it. The instruction’s job is to drive compliance, not quality.
Give every AI agent access to your organization's institutional knowledge.