Evaluating LLM Outputs: Building Evals That Actually Catch Regressions

Shipping an LLM feature is easy. Keeping it good while you tweak prompts, swap models, and add tools is the hard part. A one-word change to a system prompt can silently degrade an extraction pipeline, and you will not notice until a customer does. This guide lays out the evaluation discipline we use at CodeAustral to catch those regressions before they reach production: golden datasets, rubric scoring, LLM-as-judge, and CI gating that actually blocks a bad deploy.

Why "It Looks Fine" Is Not an Eval

Most teams test LLM output the same way they test a chatbot at a demo: type a few prompts, eyeball the answers, ship. This works exactly until the system has more than one moving part. Once you have a prompt, a model version, a retrieval layer, and a handful of tools, the number of ways output can quietly drift exceeds what any human can spot-check by hand.

The failure mode is specific and recurring. You improve the prompt to fix one case, deploy, and three other cases regress because the model now over-applies the new instruction. Without a fixed set of inputs and expected behaviors, you have no way to see the trade. Evals are the regression suite for non-deterministic software. They turn "the output seems worse" into "criterion 4 dropped from 0.92 to 0.71 on the extraction set."

The core idea is borrowed straight from traditional testing: pin down inputs, define what good looks like, run on every change, and fail loudly when a number moves the wrong way. The only twist is that "what good looks like" is rarely an exact string match.

Offline vs Online Evals

There are two distinct evaluation surfaces, and conflating them is a common mistake.

Offline evals run against a curated dataset before code ships. They are deterministic in structure (same inputs every time), fast, and cheap enough to run in CI. They answer: "Did this change make the system better or worse on cases I already understand?"

Online evals run against live production traffic after ship. They sample real requests, score them asynchronously, and surface drift you never anticipated, new input shapes, edge cases your golden set never imagined, degradation from an upstream model update.

You need both, and they catch different things:

Offline catches *known* regressions: a prompt edit that breaks an extraction format, a model swap that changes tone, a tool description that lowers call rate.
Online catches *unknown* regressions: a new customer segment phrasing requests differently, a slow quality decline, prompt-injection attempts, latency creep.

A practical rule: offline evals gate the deploy; online evals trigger the investigation. Online failures often become tomorrow's golden-dataset entries. The pipeline is a loop, not a line, production surprises flow back into the offline suite so the same surprise never ships twice.

Building a Golden Dataset That Earns Its Keep

A golden dataset is a versioned collection of input/expected-behavior pairs. It is the single most valuable asset in your eval system and the one most often neglected.

Principles we hold to:

Source from reality, not imagination. The best cases come from production logs, support tickets, and bug reports, not from a brainstorm. Synthetic cases fill coverage gaps but should never dominate.
Every fixed bug becomes a case. When you find a regression, add the failing input to the set before you fix it. This is the LLM equivalent of a regression test and the mechanism that makes the suite compound in value.
Stratify deliberately. Include the easy 80%, but weight toward the adversarial edges: ambiguous inputs, multilingual content, malformed data, prompt-injection attempts, and the long-tail formats that broke things before.
Keep it small and sharp. A focused set of 100–300 high-signal cases that runs in two minutes beats 5,000 redundant cases nobody waits for. Coverage is about diversity of failure modes, not raw count.
Version it like code. The dataset lives in git next to the prompts. When you change expected behavior, that is a reviewable diff, not a silent edit.

For tasks with a single correct answer, extraction, classification, structured parsing, your expected value is exact and scoring is trivial. The hard cases are open-ended: summaries, rewrites, conversational replies. Those need rubrics.

Scoring: Exact Match, Rubrics, and the Right Tool for Each

Not every output is scored the same way. Match the scorer to the task.

Deterministic scorers handle anything with an objective answer:

Exact string or normalized match for classification labels
JSON schema validation for structured extraction
Numeric tolerance for figures and calculations
Regex or substring checks for required fields

These are fast, free, and flake-free. Use them wherever the task allows. If you constrain output with structured outputs (a JSON schema on the model call), a large class of "is the format right" checks disappears entirely, the format is guaranteed, so your eval can focus on whether the *content* is correct.

Rubric scoring handles open-ended output. Instead of comparing to one golden string, you score against a checklist of independently gradeable criteria. A summary rubric might be:

Captures the three key facts from the source (0–1)
Introduces no information absent from the source (0–1)
Stays under the length limit (0–1)
Uses a neutral, professional register (0–1)

The discipline that makes rubrics work: criteria must be *specific and independently checkable*. "The summary is good" is unscorable and produces noisy results. "The summary mentions the refund deadline" is binary and stable. Vague criteria are the number-one cause of flaky LLM-as-judge scores.

LLM-as-Judge: Powerful, but Calibrate It

For open-ended criteria that a regex cannot capture ("does this response stay on-brand?", "is this explanation factually grounded in the provided document?"), the practical scorer is another LLM acting as a judge. You hand it the input, the output, and a rubric, and ask for a per-criterion score with reasoning.

A few hard-won rules:

Judge per criterion, not holistically. Asking for one overall 1–10 score produces mush. Asking five binary yes/no questions produces signal you can act on.
Use a strong model as the judge. A capable model such as claude-opus-4-8 gives far more consistent judgments than a small one, and judge consistency is what your entire eval depends on. The cost is trivial next to the cost of a missed regression.
Force structure. Constrain the judge to emit a JSON object with a score and a short justification per criterion. This makes results parseable and gives you the reasoning to audit disagreements.
Calibrate against humans. Before you trust the judge, have a person score 30–50 cases and compare. If judge and human disagree often, the rubric is ambiguous, fix the rubric, not the threshold.

Here is a minimal, structured LLM-as-judge call using the Anthropic SDK. The schema guarantees the judge returns scores you can parse and aggregate:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const RUBRIC = `Score each criterion 0 or 1. Be strict.
1. grounded: every claim is supported by the SOURCE; no invented facts.
2. complete: mentions all THREE required facts from the SOURCE.
3. concise: stays under 80 words.
4. on_brand: neutral, professional register; no hype.`;

async function judge(source: string, output: string) {
  const res = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    thinking: { type: "adaptive" },
    system: "You are a strict evaluation judge. Score only what the rubric asks.",
    messages: [
      {
        role: "user",
        content: `RUBRIC:\n${RUBRIC}\n\nSOURCE:\n${source}\n\nOUTPUT TO SCORE:\n${output}`,
      },
    ],
    output_config: {
      format: {
        type: "json_schema",
        schema: {
          type: "object",
          properties: {
            grounded: { type: "integer", enum: [0, 1] },
            complete: { type: "integer", enum: [0, 1] },
            concise: { type: "integer", enum: [0, 1] },
            on_brand: { type: "integer", enum: [0, 1] },
            notes: { type: "string" },
          },
          required: ["grounded", "complete", "concise", "on_brand", "notes"],
          additionalProperties: false,
        },
      },
    },
  });

  const block = res.content.find((b) => b.type === "text");
  return JSON.parse(block!.text) as Record<string, number | string>;
}

The judge returns a per-criterion breakdown, not a single opaque grade. When a score drops, the notes field tells you *why*, which is the difference between a debuggable eval and a mysterious one.

Wiring Evals into CI to Gate Regressions

An eval suite that runs only when someone remembers to run it is theater. The value comes from automatic execution on every change to a prompt, a model ID, a tool definition, or any code in the LLM path.

The shape of a CI gate:

On pull request, run the offline suite against the golden dataset.
Aggregate per-criterion pass rates across all cases.
Compare against a committed baseline (the scores from main).
Fail the build if any metric drops beyond a tolerance.

The key design decision is gate on aggregates, not individual cases. LLM output has inherent variance; a single case flipping is noise. A criterion's pass rate dropping from 0.94 to 0.78 across 200 cases is signal. Set a tolerance band (we typically allow a 2–3 point dip to absorb noise) and fail outside it.

# .github/workflows/evals.yml
name: llm-evals
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: npm run eval -- --baseline=baseline.json --tolerance=0.03

Practical guardrails so the gate stays trustworthy:

Cache and parallelize the generation calls so the suite finishes in minutes, not the coffee break that kills adoption.
Pin the judge model. If the judge version changes, your baseline is no longer comparable. Treat the judge model ID as part of the eval configuration and version it.
Make the baseline explicit and reviewable. When a metric *should* move (you intentionally changed behavior), updating the baseline is a deliberate commit, not an automatic overwrite.
Report the diff in the PR. A comment showing which criteria moved, and by how much, turns the gate from a blocker into a conversation.

Catching Prompt Regressions Specifically

Prompt regressions are the most frequent and most invisible. They deserve targeted treatment because the blast radius of a one-line edit is genuinely surprising.

What we watch for:

Over-application. You add "always cite the source," and the model now appends citations to outputs where none exist, fabricating them. The grounding criterion catches this.
Instruction collision. A new instruction contradicts an old one; the model resolves it unpredictably. The case-level diff shows which inputs flipped.
Format drift. A reworded prompt subtly changes output structure, breaking a downstream parser. Schema validation in the eval catches it before the parser does in production.
Model-swap interactions. A prompt tuned for one model behaves differently on another. Newer models tend to follow instructions more literally, so aggressive phrasing ("CRITICAL: YOU MUST...") that compensated for an older model's reluctance can cause over-triggering on a newer one. Re-run the full suite on any model change; never assume a prompt ports cleanly.

The workflow that ties it together: keep prompts in version control, attach the eval suite to the prompt files, and require a green run before merge. When something does slip through to production and online evals flag it, the fix is not just patching the prompt, it is adding the failing input to the golden set so the regression is permanently fenced off.

A Practical Rollout Order

You do not build all of this at once. The order that delivers value fastest:

Collect 30–50 real cases from logs and tickets. Write expected behavior for each.
Add deterministic scorers for everything with an objective answer. Free, instant, high coverage.
Write rubrics for your open-ended outputs, specific, binary criteria only.
Stand up an LLM judge, calibrate it against human scores on a sample, fix ambiguous criteria.
Commit a baseline and add the CI gate. Start with a generous tolerance, tighten as you learn the noise floor.
Add online sampling in production and route surprises back into the golden set.

Each step is useful on its own. Even step two, deterministic scorers on a handful of real cases, will catch regressions you are currently shipping blind.

Frequently Asked Questions

How many cases does a golden dataset need?

Quality beats quantity. A focused set of 100–300 high-signal cases covering diverse failure modes is far more useful than thousands of redundant ones. Start with 30–50 sourced from real logs and bug reports, then grow the set every time you fix a regression. Prioritize coverage of distinct failure types over raw volume, and keep the suite fast enough to run in CI.

Is LLM-as-judge reliable enough to gate deploys?

Yes, when calibrated. Use a strong model as the judge, score per individual criterion rather than holistically, force structured output, and validate judge scores against human grading on 30–50 cases before trusting it. Gate on aggregate pass rates across many cases, not single results, since per-case variance is expected. An uncalibrated judge with vague rubrics is unreliable; a calibrated one with binary criteria is dependable.

What is the difference between offline and online evals?

Offline evals run against a curated golden dataset before code ships and gate the deploy by catching known regressions. Online evals sample live production traffic after ship and surface unknown drift, new input shapes, slow quality decline, or upstream model changes. Offline answers "did this change break something I understand?"; online answers "what is breaking that I never anticipated?" You need both, and online failures should feed back into the offline set.

How do I stop CI evals from being flaky?

Gate on aggregate metrics across the full dataset, not individual case results, and set a tolerance band (a 2–3 point dip) to absorb normal variance. Pin the judge model version so baselines stay comparable, constrain outputs with structured schemas, and write binary, independently checkable rubric criteria. Most flakiness traces back to vague criteria or holistic scoring, fix the rubric before loosening the threshold.

When should I update the eval baseline?

Update it only when you have intentionally changed behavior and confirmed the new scores reflect a genuine improvement, not a regression. The baseline update should be a deliberate, reviewable commit, never an automatic overwrite, so a quiet quality drop cannot slip in disguised as a baseline refresh. If a metric moves and you did not intend it to, investigate before touching the baseline.

Working with CodeAustral

We build LLM features that stay reliable under change, evaluation harnesses, golden datasets, CI gating, and the production monitoring that keeps them honest. If you are shipping AI into a product and want it to survive its own iteration, tell us what you are building. Send a short brief at https://codeaustral.com/contact and we will tell you where the regressions are hiding.