Evals before vibes · Swapnil Surdi

There’s a question that comes up on every AI project, and it’s almost always answered wrong: which model should we use?

The wrong answer is a vibe. Someone tried GPT and Claude side by side on three hand-picked prompts last week, one of them “felt smarter,” and that’s the model now wired into production. I’ve watched this happen, and I’ve watched the bill and the bug reports that follow. The right answer is boring and a little tedious: you build a small evaluation harness, you define what “better” means as numbers, and you let the numbers pick. Evals before vibes.

I spent a stretch at TTI (Treatment Technologies & Insights) benchmarking the field — ChatGPT, Claude, Gemini, Llama, Qwen — for healthcare-compliance workflows, the kind where a wrong answer isn’t a bad demo, it’s a problem. That work taught me the discipline this post is about. Not “which model is best” — that question is malformed — but how to decide, repeatably, per task, and how to keep the decision honest as models and prompts churn underneath you.

”Best” is not a property of a model

The first thing the benchmarking made obvious: there is no best model. There’s a best model for a task, on a metric, at a price you’ll tolerate. Change any of those three and the leaderboard reshuffles.

So before comparing anything, I had to name the axes that actually mattered for the workload:

Latency (p50 and p95). The median tells you the common case; the tail tells you what your users actually feel. A model with a great p50 and a terrible p95 will look fine in a demo and miserable under load. You evaluate the p95 or you’re evaluating a fantasy.
Cost per token. Input and output, priced against the real token volume of the task — not a toy prompt. A model that’s 30% more accurate and 4× the cost is a different decision for a high-volume classifier than for a once-a-day summary, and you can’t make that call without the per-token math in front of you.
Task accuracy. Measured against a fixed golden set, with a metric that fits the task — exact match, F1, a rubric score, whatever actually captures “right” for this job. This is the axis people think they’re judging by feel, and the one feel is worst at.
Capability fit. Some tasks need a long context window. Some need reliable structured-output or tool-calling. Some need a model you can run on your own hardware for data-residency reasons — which, in a compliance setting, can override every other axis at once.

Once those are written down, “which model” stops being an argument and becomes a measurement. Different tasks weight the axes differently, so the winner legitimately differs per task — the fast cheap model wins the high-volume classification step, a stronger model wins the gnarly extraction, and a self-hostable one wins anything that can’t leave your boundary. That’s not indecision. That’s the correct answer to a question that was never going to have one winner.

Retrieval gets gated before the prompt does

Most of these workflows were retrieval-augmented, and RAG has its own version of the vibes trap — one that’s even more seductive, because when an answer is wrong it’s so tempting to go fiddle with the prompt.

The discipline I held to: retrieval quality gets measured and gated before anyone touches a prompt. If the right context isn’t in the retrieved set, no amount of prompt engineering conjures it back — the model can’t cite what it never saw. So retrieval is evaluated on its own terms, with ranking metrics:

NDCG — is the genuinely relevant context ranked near the top, where the model will actually attend to it, not buried at position nine?
MRR — how far down do you have to scroll, on average, to find the first right chunk?

A regression in retrieval shows up here, as a number, before it ever reaches the generation step disguised as a “the model got dumber” complaint. I’ve lost real time to debugging a “model problem” that was a retrieval problem the whole way down — the embedding model had changed, recall quietly dropped, and every downstream symptom pointed at the LLM. Gate retrieval on NDCG and MRR first, and you stop misattributing retrieval failures to the model. The order is the discipline: fix what the model can see before you tune how it reasons.

Treat models as swappable, because you’ll have to swap them

Here’s the architectural consequence that falls out of all this, and it’s the most important one.

If the best model genuinely differs per task — and it does — and if models change underneath you constantly — and they do, new versions, new prices, deprecations, the occasional provider outage — then the only sane architecture treats the model as a swappable component. Not a load-bearing wall you built the house around. A part you can pull and replace.

In practice that meant failover across providers: an abstraction over the model call so a given task can name its preferred model and fall back to a different provider entirely when the first is down, degraded, or priced out of the budget. That’s not a resilience nice-to-have bolted on at the end. It’s the direct architectural consequence of taking the evals seriously — once you accept “best is per-task and impermanent,” provider lock-in stops being a convenience and becomes a liability. You design for substitution from the start, because the whole point of having metrics is that they’re allowed to change the answer, and your architecture has to let them.

What a minimal eval harness actually looks like

None of this requires a platform. The smallest version that works has three parts, and you can stand it up in an afternoon:

1. A golden set. A fixed, version-controlled collection of representative inputs with known-good outputs (or rubrics, where the output is open-ended). This is the single highest-leverage artifact in the whole effort, and the hardest to get right — it has to cover the easy cases, the edge cases, and the failure modes that actually hurt in your domain. For compliance work, that meant the adversarial and ambiguous cases got more weight, not less, because that’s where a wrong answer does damage. Curating this set well is most of the work, and it’s worth every hour.

2. Metrics that fit the task. Accuracy and the operational axes — p50/p95 latency, cost per token — captured on the same run, against the same golden set, for every candidate model. Retrieval steps additionally carry NDCG and MRR. The output is a comparison table, not a feeling: every model, every metric, side by side, same inputs.

3. A regression gate. The part that makes the harness pay rent forever. Wire it into CI so a model swap, a prompt edit, or a retrieval-config change reruns the evals and fails the build if a key metric drops past a threshold. This is what stops the slow, invisible rot — the prompt tweak that fixes one case and quietly breaks five, the model “upgrade” that regresses your tail latency, the embedding change that drops recall. Without a gate, every change is a coin flip you don’t get to see land. With one, “did this make things worse?” has an automatic, numerical answer before anything ships.

The point

Choosing an LLM by feel is how you ship a regression you can’t see, lock yourself to a provider you’ll want to leave, and discover your p95 in an incident channel instead of a dashboard. The alternative isn’t expensive or exotic — a golden set, a handful of metrics that fit the task, retrieval gated on ranking quality before the prompt, and a regression check in CI. That’s the whole thing.

Models will keep getting better, cheaper, and more numerous, and “which one is best” will keep being the wrong question. The durable skill isn’t picking today’s winner. It’s building the harness that picks for you, re-picks when the ground shifts, and tells you — in numbers, before your users do — the moment a change made things worse.