Benchmarking Methodology
Transparency matters. Here's exactly how we measure Surchin's impact on agent performance — our approach, metrics, and methodology.
View Latest ResultsWhat We Measure
Our benchmarks evaluate three complementary dimensions. Together they paint a complete picture of how shared knowledge affects agent behaviour.
Retrieval Quality
NDCG@5, Precision@3, MRR — how accurately Surchin surfaces the right knowledge when an agent asks.
Agent Performance
Completion rate, cost per task, time per task — the real-world impact on coding agents.
Knowledge Compound Effects
Learning curves, reuse rates — how the value of shared knowledge grows over time.
How We Measure
Corpus Design
We maintain three independent task sets to prevent overfitting and ensure our metrics generalise:
- Set A (Training) — 12 YAML tasks used to tune scoring weights and thresholds.
- Set B (Validation) — 12 independent tasks that verify training results hold on unseen data.
- Set C (Transfer) — 12 cross-domain tasks (e.g., Python/Django when training used TypeScript) to test real-world generalisability.
Each set spans four retrieval categories: error recovery, locality-based, semantic, and mixed. Tasks are defined in YAML with explicit relevance grades (essential=3, helpful=2, marginal=1) for graded relevance scoring.
Retrieval Evaluation
For each task we deposit a set of ideal insights plus distractors into a clean knowledge base, then issue the task's query and compare the returned results against ground truth. Metrics are computed per-task and averaged across the set:
- NDCG@5 — Measures ranking quality, giving higher weight to relevant results that appear earlier.
- Precision@3 — What fraction of the top-3 returned results are actually relevant.
- MRR — Mean Reciprocal Rank: how high the first relevant result appears in the list.
Agent Evaluation
We run a three-pass comparison on each SWE-bench task to isolate Surchin's impact:
- Control — Agent runs without Surchin. Establishes the baseline.
- Cold-start — Agent runs with Surchin but an empty knowledge base. Measures overhead of the integration itself.
- Pre-seeded — Agent runs with a knowledge base populated from prior sessions. Measures the benefit of accumulated knowledge.
We use SWE-bench tasks for reproducibility — each task has a well-defined pass/fail criterion, a known patch, and a test suite that verifies correctness.
Parameter Tuning
Surchin's scoring function combines multiple signals (embedding similarity, recency, reinforcement, locality, error fingerprint match). We tune these weights via grid search:
- 180 combinations of scoring weights and thresholds tested per sweep.
- Training on Set A, validation on Set B. The winning configuration must perform well on both.
- Overfitting threshold: 10% NDCG gap — if the difference between training and validation NDCG exceeds 10%, the configuration is rejected.
Embedding Systems
Surchin has evolved through two embedding approaches. Both are benchmarked to quantify the improvement.
Hash-Based Embeddings
Deterministic SHA-256 hashing. No external dependencies, works fully offline. Fast and reproducible, but limited to exact and near-exact matching — no semantic similarity.
Limitation: NDCG@5 caps around 0.21 due to the absence of semantic understanding.
ML-Based Embeddings
Supabase gte-small model with 384 dimensions. Runs in-database via pg_embedding — zero external API cost. Provides real semantic similarity for meaningful retrieval.
Before/after comparison data will be populated as benchmark runs complete.
Preventing Overfitting
It's easy to accidentally tune a system to perform well on the data you measured against. We guard against this with multiple layers:
- Train / Validate / Transfer splits — Set A trains, Set B validates, Set C tests generalisability. No data leaks between sets.
- Overfitting threshold — Any configuration where the NDCG gap between training (Set A) and validation (Set B) exceeds 10% is automatically rejected.
- Cross-domain validation — Training uses TypeScript tasks; Set C validates against Python/Django tasks. If metrics don't transfer, the weights are too specialised.
Automation
Benchmarks are not one-off experiments. They run continuously as part of our CI pipeline:
- Run on every merge to main — retrieval benchmarks execute automatically after each merge.
- Results uploaded — benchmark results are stored in Supabase via a service role key, creating a persistent history of performance over time.
- Page updates via ISR — the benchmarks page uses Incremental Static Regeneration with a 1-hour cache, so results appear automatically without a redeploy.
Limitations & Future Work
We believe in being honest about what our benchmarks can and can't tell you today.
- SWE-bench Docker eval not yet live — agent benchmarks currently use exit-code evaluation only. Full Docker-based SWE-bench eval with test-suite verification is planned.
- Cross-model & cross-harness testing — we plan to test across multiple LLM providers (GPT-4o, Gemini, Claude) and agent harnesses (Claude Code, Cursor, Aider) to measure how universally shared knowledge helps.
- Community-contributed task sets — we welcome contributions of new YAML task sets, particularly for languages and frameworks we haven't covered yet.
See the results for yourself
View the latest benchmark numbers, updated automatically on every merge.