Skip to main content

Benchmarking Methodology

Transparency matters. Here's exactly how we measure Surchin's impact on agent performance — our approach, metrics, and methodology.

View Latest Results

What We Measure

Our benchmarks evaluate three complementary dimensions. Together they paint a complete picture of how shared knowledge affects agent behaviour.

Retrieval Quality

NDCG@5, Precision@3, MRR — how accurately Surchin surfaces the right knowledge when an agent asks.

Agent Performance

Completion rate, cost per task, time per task — the real-world impact on coding agents.

Knowledge Compound Effects

Learning curves, reuse rates — how the value of shared knowledge grows over time.

How We Measure

Corpus Design

We maintain three independent task sets to prevent overfitting and ensure our metrics generalise:

  • Set A (Training) — 12 YAML tasks used to tune scoring weights and thresholds.
  • Set B (Validation) — 12 independent tasks that verify training results hold on unseen data.
  • Set C (Transfer) — 12 cross-domain tasks (e.g., Python/Django when training used TypeScript) to test real-world generalisability.

Each set spans four retrieval categories: error recovery, locality-based, semantic, and mixed. Tasks are defined in YAML with explicit relevance grades (essential=3, helpful=2, marginal=1) for graded relevance scoring.

Retrieval Evaluation

For each task we deposit a set of ideal insights plus distractors into a clean knowledge base, then issue the task's query and compare the returned results against ground truth. Metrics are computed per-task and averaged across the set:

  • NDCG@5 — Measures ranking quality, giving higher weight to relevant results that appear earlier.
  • Precision@3 — What fraction of the top-3 returned results are actually relevant.
  • MRR — Mean Reciprocal Rank: how high the first relevant result appears in the list.

Agent Evaluation

We run a three-pass comparison on each SWE-bench task to isolate Surchin's impact:

  1. Control — Agent runs without Surchin. Establishes the baseline.
  2. Cold-start — Agent runs with Surchin but an empty knowledge base. Measures overhead of the integration itself.
  3. Pre-seeded — Agent runs with a knowledge base populated from prior sessions. Measures the benefit of accumulated knowledge.

We use SWE-bench tasks for reproducibility — each task has a well-defined pass/fail criterion, a known patch, and a test suite that verifies correctness.

Parameter Tuning

Surchin's scoring function combines multiple signals (embedding similarity, recency, reinforcement, locality, error fingerprint match). We tune these weights via grid search:

  • 180 combinations of scoring weights and thresholds tested per sweep.
  • Training on Set A, validation on Set B. The winning configuration must perform well on both.
  • Overfitting threshold: 10% NDCG gap — if the difference between training and validation NDCG exceeds 10%, the configuration is rejected.

Embedding Systems

Surchin has evolved through two embedding approaches. Both are benchmarked to quantify the improvement.

Initial

Hash-Based Embeddings

Deterministic SHA-256 hashing. No external dependencies, works fully offline. Fast and reproducible, but limited to exact and near-exact matching — no semantic similarity.

Limitation: NDCG@5 caps around 0.21 due to the absence of semantic understanding.

Current

ML-Based Embeddings

Supabase gte-small model with 384 dimensions. Runs in-database via pg_embedding — zero external API cost. Provides real semantic similarity for meaningful retrieval.

Before/after comparison data will be populated as benchmark runs complete.

Preventing Overfitting

It's easy to accidentally tune a system to perform well on the data you measured against. We guard against this with multiple layers:

  • Train / Validate / Transfer splits — Set A trains, Set B validates, Set C tests generalisability. No data leaks between sets.
  • Overfitting threshold — Any configuration where the NDCG gap between training (Set A) and validation (Set B) exceeds 10% is automatically rejected.
  • Cross-domain validation — Training uses TypeScript tasks; Set C validates against Python/Django tasks. If metrics don't transfer, the weights are too specialised.

Automation

Benchmarks are not one-off experiments. They run continuously as part of our CI pipeline:

  • Run on every merge to main — retrieval benchmarks execute automatically after each merge.
  • Results uploaded — benchmark results are stored in Supabase via a service role key, creating a persistent history of performance over time.
  • Page updates via ISR — the benchmarks page uses Incremental Static Regeneration with a 1-hour cache, so results appear automatically without a redeploy.

Limitations & Future Work

We believe in being honest about what our benchmarks can and can't tell you today.

  • SWE-bench Docker eval not yet live — agent benchmarks currently use exit-code evaluation only. Full Docker-based SWE-bench eval with test-suite verification is planned.
  • Cross-model & cross-harness testing — we plan to test across multiple LLM providers (GPT-4o, Gemini, Claude) and agent harnesses (Claude Code, Cursor, Aider) to measure how universally shared knowledge helps.
  • Community-contributed task sets — we welcome contributions of new YAML task sets, particularly for languages and frameworks we haven't covered yet.

See the results for yourself

View the latest benchmark numbers, updated automatically on every merge.