Surchin Cuts Agent Costs 21% With Statistically Significant Results
We ran 162 Opus 4.6 agent sessions across Python and Android codebases. The results: significant cost reduction, faster completion, and — perhaps more importantly — dramatically more predictable agent behavior.
Our previous post showed how to get Opus to follow Surchin's workflow 98% of the time. That answered the compliance question. But does compliance actually help? Does accumulated knowledge make agents cheaper, faster, or more reliable?
To find out, we built 18 pattern-repetition tasks across two real-world ecosystems and ran each one multiple times with and without Surchin. Here's what 162 agent runs and $65 taught us.
The Setup
Two Ecosystems, Two Complexity Levels
We chose codebases that represent real-world agent workloads at different complexity levels:
- Android/NowInAndroid — Google's reference architecture app. Well-structured, single-module tasks. The "easy" tier.
- Python/FastAPI — Full-stack API with SQLModel, Alembic migrations, and CRUD patterns. Multi-file, multi-concern tasks. The "hard" tier.
Each ecosystem has 9 tasks organized into 3 families of 3. Within each family, the tasks are structurally identical — same pattern, different entities. If an agent learns how to add a new ViewModel in task 1, that knowledge should transfer directly to tasks 2 and 3.
This is the scenario where Surchin should shine: repetitive work across a consistent codebase.
Three-Pass Methodology
Every ecosystem runs through three passes:
- Control — Agent solves all 9 tasks without Surchin. No knowledge base, no MCP tools. Pure baseline.
- Cold Start — Agent has Surchin with an empty knowledge base. It queries (gets nothing back), works, then deposits what it learned. The KB builds up organically across the 9 tasks.
- Pre-Seeded — Agent has Surchin with the KB already populated from the cold-start pass. It queries, gets relevant knowledge immediately, and works with the benefit of prior experience.
Each task runs 3 times per condition for statistical significance — 27 paired observations per ecosystem, 162 total runs.
Statistical Method
We use the Wilcoxon signed-rank test for p-values rather than paired t-tests. The Wilcoxon test makes no assumptions about normal distribution, which matters when you have high-variance cost data from LLM agent runs. All p-values reported are for the control vs. pre-seeded comparison.
The Results
| Ecosystem | Resolution | Cost Reduction | Time Reduction | p-value |
|---|---|---|---|---|
| Android/NIA | 100% | 21.3% | 23.5% | p<0.001 |
| Python/FastAPI | 66.7% | 7.1% | 13.6% | p=0.052 |
Android shows a statistically significant cost and time reduction. Python is borderline. Let's look at each one.
Android/NowInAndroid: The Clean Signal
This ecosystem produced the cleanest results — 100% task resolution across all passes, with clear cost and time improvements.
| Condition | Avg Cost/Task | Avg Time/Task | Std Dev (Cost) |
|---|---|---|---|
| Control | $0.2307 | 67s | $0.0411 |
| Cold Start | $0.2084 | 58s | $0.0287 |
| Pre-Seeded | $0.1815 | 51s | $0.0133 |
The progression tells a clean story: control costs $0.23/task, cold start drops to $0.21, pre-seeded drops further to $0.18. A 21% reduction from control to pre-seeded.
But look at the standard deviation column. Control runs vary by $0.04. Pre-seeded runs vary by $0.01. That's a 3x reduction in cost variance. The agent isn't just cheaper — it's predictable.
Why does Android work so well? The tasks are well-scoped (add a ViewModel, create a navigation route), the codebase is consistently structured, and the patterns transfer cleanly between tasks. When the agent queries Surchin and gets back "here's exactly how the last ViewModel was added," it follows the same steps with minimal exploration.
Python/FastAPI: Where Complexity Wins
The Python ecosystem is the most complex: tasks involve Alembic migrations, SQLModel schemas, CRUD endpoints, and test files. Multiple files, multiple concerns, multiple ways to get it wrong.
| Condition | Avg Cost/Task | Avg Time/Task | Std Dev (Cost) |
|---|---|---|---|
| Control | $0.6464 | 149s | $0.2187 |
| Cold Start | $0.5608 | 128s | $0.1843 |
| Pre-Seeded | $0.6004 | 129s | $0.1956 |
The result is borderline significant (p=0.052) and contains a surprise: cold-start outperforms pre-seeded.
Control costs $0.65/task. Cold start drops to $0.56. But pre-seeded rises back to $0.60. The agent that built its own knowledge base in-session did better than the one that received a pre-populated KB.
Why? We think this comes down to contextual relevance. For simple tasks (Android), knowledge transfers cleanly — "add a ViewModel" is "add a ViewModel" regardless of which entity. For complex multi-file tasks, the specific context matters more. An Alembic migration for a User model involves different decisions than one for an Order model. Pre-populated knowledge from a User migration may actually mislead the agent when it's working on Order.
In-session knowledge, by contrast, was built while the agent was actively working in the same codebase, with the same dependencies loaded, solving structurally similar problems moments earlier. It's more contextually fresh.
This is a genuinely interesting finding. It suggests that for complex tasks, the cold-start pass — where the agent queries, gets nothing, works, and deposits — may be the most valuable configuration. The act of depositing forces the agent to articulate what it learned, which benefits subsequent tasks in the same session more than a stale KB does.
The Variance Story
If you take one thing from this post, make it this: Surchin makes agents predictable.
Cost reduction is the headline, but variance reduction may be the more important result for production use. Here's the variance comparison across both ecosystems:
| Ecosystem | Control Std Dev | Pre-Seeded Std Dev | Reduction |
|---|---|---|---|
| Android/NIA | $0.0411 | $0.0133 | 3.1x |
| Python/FastAPI | $0.2187 | $0.1956 | 1.1x |
Android shows 3x lower variance with Surchin. Python's variance stays high because the tasks themselves are inherently variable — there are multiple valid approaches to an Alembic migration, and the agent explores different ones each run.
Why does predictability matter? If you're running agents in production — CI/CD pipelines, automated refactoring, bulk migrations — you need to estimate costs. An agent that costs $0.18 +/- $0.01 is budgetable. An agent that costs $0.23 +/- $0.04 is a risk. Surchin narrows the distribution.
Cold Start Isn't Cold
One concern we hear: "Surchin is useless until the KB is populated." The data says otherwise.
| Ecosystem | Control Cost | Cold Start Cost | Savings |
|---|---|---|---|
| Android/NIA | $0.2307 | $0.2084 | 9.7% |
| Python/FastAPI | $0.6464 | $0.5608 | 13.2% |
Even with an empty knowledge base, cold-start runs cost 10-13% less than control. The first query returns nothing, but the workflow itself — query, work, deposit — creates value for tasks later in the session. By task 4 or 5, the KB has enough entries that subsequent queries return useful results.
For Python, cold start actually outperforms pre-seeded. The "empty" KB isn't a liability — it's a clean slate that builds contextually relevant knowledge as the agent works.
What This Means
For users
If you're running agents on repetitive tasks across a consistent codebase — and most real-world agent work fits this description — Surchin pays for itself quickly. A 21% cost reduction on a $100/day agent bill saves $21/day. The Surchin overhead (embedding generation, KB queries) adds less than 2% to task cost.
For complex, high-variance workloads
Start with cold start. Let the agent build its own KB during the session rather than pre-populating. For complex multi-file tasks, session-built knowledge appears more valuable than pre-populated knowledge.
For production pipelines
The variance reduction may matter more than the cost reduction. If you're budgeting for agent-assisted CI/CD, Surchin makes your cost estimates 3x tighter.
Limitations and Next Steps
We're being honest about the gaps:
- Two ecosystems so far — We have a third ecosystem (TypeScript/Cal.com) in progress and will publish those results once the eval infrastructure is validated.
- Python borderline significance — p=0.052 is close but not below the 0.05 threshold. More repetitions or simpler task decomposition could push this either way.
- Single model — All runs use Opus 4.6. We expect different dynamics with Sonnet (which follows instructions more literally) and Haiku (which may not benefit as much from accumulated knowledge due to lower reasoning capacity).
- Pattern-repetition tasks only — These benchmarks test the best case for Surchin: structurally similar tasks in the same codebase. We haven't yet tested diverse, unrelated tasks where knowledge transfer is less direct.
Next up: adding TypeScript/Cal.com results, expanding to Sonnet/Haiku runs, and testing on real customer workloads with organic (non-synthetic) task distributions.
Total cost of this research: $65 in API calls across 162 Opus 4.6 agent runs.