What are AI Evals and Why They Matter (It’s Not Just Testing)

The world of AI is moving fast. Models like GPT-5.x, Claude, and Gemini are no longer just answering questions, as they’re writing production code, drafting medical summaries, executing trades, and orchestrating multi-step workflows on real systems. But here’s a question almost nobody asked five years ago: how do we actually know they work? In July 2025, an AI coding assistant from Replit deleted an entire production database despite being explicitly told not to. The team had run benchmarks. The model “passed.” It passed in the same way a student can ace a multiple-choice test and still be unable to write a real essay. The benchmarks measured something — just not the thing that mattered. This is the gap that evals are designed to close. When developers first try to test AI systems, they reach for what they know: unit tests. Write inputs, assert expected outputs, run the suite. That worked beautifully for forty years of deterministic software. It does not work for AI. The same prompt produces different outputs each run. There is no single “correct” answer to summarize this email or write a polite refund response . Quality is a spectrum, not a checkbox. Traditional testing wasn’t built for systems that reason in probabilities, but evals are the answer for such ambiguous systems. What is an eval, in plain language? Think of an eval like a driving test . To know if someone can drive safely, you don’t just quiz them on traffic rules. You put them in a real car, in real traffic, and watch how they handle parallel parking, highway merges, and a pedestrian stepping off the curb. You score them on multiple dimensions: control, judgment, awareness. You don’t expect a perfect score — you expect a high enough one to trust them on the road. An eval does the same thing for an AI system. You give it a curated set of realistic situations. You watch what it does. You score the outputs against what “good” looks like. You aggregate those scores into something a team can act on. How an Eval works Technically, an eval is a structured measurement of an AI system’s output quality on a defined task . It has four moving parts: A test set — curated inputs that represent what the system will actually face. The best ones, called golden datasets , are built from real production failures, not synthetic examples. The AI system under test — could be a single model, a RAG pipeline, or a full multi-step agent. The outputs — what the AI actually produced: answers, tool calls, full execution traces. The evaluator — the thing that grades. This might be deterministic code, another LLM, a human reviewer, or all three layered together. The output is a score : pass rate, accuracy, faithfulness, tone — whatever dimensions matter for your application. Traditional testing vs. AI evals The mismatch between unit tests and AI is worth showing directly. A unit test for add(2, 2) expects exactly 4. Forever. Anything else is a bug. An eval for “summarize this support ticket” might accept dozens of valid summaries. Some are concise, some are thorough, some emphasize the customer’s frustration, some focus on the technical issue. None of them are wrong . The job of the eval isn’t to pick the one true answer — it’s to measure, across many examples, whether the system tends to produce summaries that are accurate, faithful to the source, and appropriately toned. This shift has consequences: Tests are binary; evals are statistical. You’re looking at a distribution of scores, not a green checkmark. Tests are cheap; evals can be expensive. An LLM-as-judge run on 500 examples costs real money. A human-graded eval costs more. Tests catch bugs; evals catch behaviors. Including ones nobody intended — biases, hallucinations, subtle quality drift after a prompt change. The four types of evals (and why teams use all of them) There isn’t one kind of eval — there’s a stack. Each layer trades cost for trust. 1. Deterministic checks. Plain code, plain rules. Did the model return valid JSON? Did the agent call refund_order instead of cancel_subscription? Does the email contain a @? Cheap, fast, and surprisingly effective for a huge class of failures. The "did the AI even attempt the right shape of answer" layer. 2. LLM-as-judge. A frontier model reads the AI’s output and scores it against a written rubric. This is now the default approach for teams that need throughput beyond what humans can provide. It offers roughly 500x to 5000x cost savings over human review while achieving around 80% agreement with human preferences — close to how much two humans agree with each other on the same task. It only works if you calibrate it: you need a sample of human-graded examples to verify the judge isn’t drifting from what real reviewers would say. 3. Benchmark evals. Standardized public tests like MMLU-Pro, GPQA, SWE-bench, and HumanEval. These are useful for picking which base model to start with — like SAT scores when comparing students. They’re a poor fit for measuring whether YOUR specific application is any good. As of 2026, frontier models have saturated most benchmarks the industry relied on two years ago, and new ones (HLE, SWE-bench Pro, LiveCodeBench) are constantly being introduced specifically because the old ones stopped discriminating between models. 4. Human evaluation. The gold standard, and the source of truth used to calibrate everything above it. Slow, expensive, irreplaceable for high-stakes domains — medical, legal, anywhere a hallucination has real consequences. A medical-specific benchmark like HealthBench exists for exactly this reason: general benchmarks don’t capture domain-specific failure modes. In production, mature teams stack all four. Code rules catch the obvious. LLM judges scale. Humans calibrate the judges and handle the edge cases. Why evals exploded in importance in 2025–2026 For a long time, evals were a nice-to-have — a thing the research lab cared about. Three things changed that. 1. AI moved from prototype to production. According to LangChain’s 2026 State of AI Agents report, more than half of organizations now have AI agents running in production. The Replit incident wasn’t an outlier, rather it was a preview. When an AI takes autonomous actions on real systems, “vibe-checking” the outputs in dev stops being acceptable. 2. Frontier models broke the old benchmarks. When everyone scores 95%+ on MMLU, the benchmark stops telling you anything. Teams now have to build their own evals against their own data, because the public ones can’t differentiate quality at the level that matters to a specific product. 3. Regulation arrived. ISO 42001 and the NIST AI Risk Management Framework are now being baked directly into evaluation pipelines as compliance gates, particularly in finance and healthcare. “We tested it and it seemed fine” is no longer an answer a regulator accepts. There’s a quote from Greg Brockman, OpenAI’s president, that gets passed around eval circles: “Evals are surprisingly often all you need.” What he means is that the discipline of measuring forces clarity . You can’t build a good eval without first answering “what does good actually look like for this task?” — and once you’ve answered that, half the problem is solved. Where this is going Evals are shifting from a one-off pre-launch test to a continuous discipline that runs at every stage of the AI lifecycle. A few specific shifts worth watching: Continuous evaluation, not pre-launch testing. Quality gates run on every pull request. Sampled production traffic gets scored live. Drift triggers alerts the same way latency spikes do. Cross-functional ownership. Evals are no longer engineering-only. Product managers validate behavior against requirements, QA owns regression, domain experts flag edge cases. If every change requires an engineer to write a script, engineering becomes the bottleneck for every quality decision. Domain-specific evals. HealthBench for medical, specialized benchmarks for law, finance, code. General benchmarks are necessary but never sufficient. Agent-specific metrics. Single-turn accuracy stops being the right measurement when an AI is taking 30 actions across 5 tools. Metrics like pass@k (does any of k attempts succeed?) and pass^k (do all k attempts succeed?) are becoming standard for agentic systems where consistency, not just capability, is what users feel. Why this matters As AI takes on more consequential tasks, the difference between a working demo and a trustworthy product is almost entirely a question of evaluation. Demos get built on confidence. Products get built on evidence. Where traditional testing gave us reliable code, evals give us accountable AI . They convert “the model felt smart in the demo” into “the model is correct on 89% of real cases, with these specific failure modes, monitored continuously, and gated by compliance checks before each release.” If you’re building anything with AI, evals aren’t optional anymore. It’s the layer that lets AI graduate from clever to trusted — and it’s quietly becoming the most important piece of infrastructure in the AI stack. Photo by Angelo Casto on Unsplash What are AI Evals and Why They Matter (It’s Not Just Testing) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://pub.towardsai.net/what-are-ai-evals-and-why-they-matter-its-not-just-testing-23e2093cb91f?source=rss----98111c9905da---4