Cutting LLM Costs Without Hurting Quality: A Field Guide

Q: Should a model decide the routing, or should I use fixed rules?

Prefer cheap deterministic signals such as task type, input length, or a fast classifier over asking an LLM to grade difficulty on every request. A router model adds its own token cost and latency, and a misroute is a silent quality regression. Reserve dynamic escalation for cascades where a cheap model handles most traffic.

Most teams discover their LLM bill is too high the same way: a finance email, a sudden spike on a dashboard, or a free tier that quietly turned into a five-figure monthly invoice. The instinct is to swap everything to the cheapest model and hope. That usually trades a cost problem for a quality problem, and quality problems are far more expensive. This field guide is the playbook we use at CodeAustral when we are asked to cut inference spend on a production system without degrading the output our clients actually ship.

The core idea is simple: cost is not one number, it is a distribution across tasks. You do not optimize "the LLM bill." You optimize each task independently, measure the result, and keep only the cuts that hold quality. Everything below is in service of that loop.

Measure Cost Per Task Before You Touch Anything

You cannot optimize what you cannot attribute. The single highest-leverage move is to tag every model call with the task it serves, then aggregate spend by task rather than by day or by user.

A "task" is a unit of work with a stable definition of done: classify a support ticket, draft a product description, extract fields from an invoice, answer a RAG question. Each task has its own quality bar and its own cost ceiling. A summarization task that costs half a cent and a multi-step agent run that costs forty cents should never sit in the same bucket.

Instrument the call site, not just the gateway. At minimum log:

task_name and a prompt_version
input tokens, output tokens, and the model id used
cached vs. uncached input tokens (these are billed differently)
latency and whether the call was retried
a quality signal where you have one (thumbs, eval score, downstream conversion)

type UsageRecord = {
  taskName: string;
  promptVersion: string;
  model: string;
  inputTokens: number;
  cachedInputTokens: number;
  outputTokens: number;
  latencyMs: number;
  retries: number;
};

// Plug your provider's published per-million-token rates here.
// Keep input, cached-input, and output as separate rates — they differ a lot.
const RATES: Record<string, { in: number; cachedIn: number; out: number }> = {
  // values are USD per 1M tokens; update from your provider's current pricing
};

function costUsd(r: UsageRecord): number {
  const p = RATES[r.model];
  if (!p) throw new Error(`No rate for model ${r.model}`);
  const freshIn = r.inputTokens - r.cachedInputTokens;
  return (
    (freshIn * p.in + r.cachedInputTokens * p.cachedIn + r.outputTokens * p.out) /
    1_000_000
  );
}

The metric that matters is cost per successful task, not cost per call. A cheaper model that fails 8% of the time and forces a retry on a frontier model is more expensive than the frontier model alone. Always divide spend by useful output.

Route the Right Model to the Right Job

Model routing is the biggest lever, and it is conceptually obvious: send hard tasks to capable models and easy tasks to cheap ones. The hard part is deciding "hard" reliably.

Across the Claude (Haiku / Sonnet / Opus), GPT, and Gemini families, the per-token price gap between the small and large model in a family is commonly 5x to 20x. If even half your traffic is genuinely easy, routing it down a tier is the largest cut you will find anywhere.

Static routing by task

Start here. It is boring and it works. Map each task to a tier based on offline evals:

Cheap tier (small models): classification, routing, extraction with a fixed schema, short rewrites, intent detection.
Mid tier: most RAG answers, drafting, code edits with clear context, structured summarization.
Top tier: ambiguous multi-step reasoning, agentic planning, anything where a wrong answer is costly or hard to detect.

Dynamic routing and cascades

A cascade runs the cheap model first, then escalates only when a confidence or validation check fails. This pays off when most inputs are easy but a minority genuinely need the big model.

async function answerWithCascade(input: Question) {
  const draft = await call("small-model", input);
  if (passesValidation(draft) && draft.selfConfidence >= 0.8) {
    return draft; // most traffic stops here
  }
  return call("large-model", input); // escalate the hard tail
}

The tradeoff is real: escalated requests pay for two model calls and add latency. A cascade only wins when the cheap model resolves a clear majority of traffic. If escalation rates climb past roughly a third, the double-spend erases the savings — measure it, do not assume it.

Avoid letting a router LLM decide the model on every request. The router call itself costs tokens and latency, and a misroute is a quality regression you will not see until a user complains. Prefer cheap deterministic signals (task type, input length, a fast classifier) over asking a model to grade difficulty.

Use Prompt Caching for Stable Context

If your prompts share a large, stable prefix — a system prompt, tool definitions, a few-shot block, a long document reused across questions — prompt caching is close to free money. Anthropic, OpenAI, and Google all offer it, and cached input tokens are billed at a steep discount versus fresh input tokens (often around a 90% reduction on the cached portion for Claude and Gemini).

To benefit, the cacheable content must be a literal prefix and byte-stable. Order your messages so the invariant part comes first:

System prompt and policies (stable, cache this)
Tool/function definitions (stable)
Few-shot examples or the reference document (stable per session)
The user's actual request (variable, goes last)

Common mistakes that silently break the cache: injecting a timestamp or request id into the system prompt, reordering tools per call, or shuffling few-shot examples. Any of these changes the prefix and you pay full price. In a multi-turn agent, caching the system prompt and tool block across every step is one of the most reliable cuts available, because that prefix is re-sent on every turn.

Compress Prompts Without Losing Signal

Every token in the input is a token you pay for on every call. Prompt compression is about removing tokens that do not change the answer.

Pragmatic wins, in rough order of return:

Trim the obvious bloat. Redundant instructions, politeness padding, repeated schema descriptions, and verbose XML where a compact format would do.
Cut few-shot examples to the minimum that holds eval scores. Teams routinely run 8 examples when 2 perform identically. Test it.
Retrieve less, but better. In RAG, stuffing 20 chunks "to be safe" is the most common cost leak we find. Tighter retrieval and reranking down to the 3 to 5 chunks that matter often *improves* answers while cutting input tokens dramatically.
Cap output length. Output tokens are typically the most expensive tokens. Ask for the format you need — JSON, a single sentence, a bounded list — and set a sensible max. Do not let a model narrate when you only need a value.

Be cautious with aggressive automated compression tools that paraphrase or drop tokens by perplexity. They can save input cost but occasionally strip the one detail the task depended on. Gate any compression behind your eval set; if quality moves, the savings are not real.

Batch and Defer What Is Not Real-Time

A large share of LLM work is not interactive: nightly classification, backfills, enrichment, evals, report generation. For that work, the batch APIs from major providers offer roughly a 50% discount in exchange for asynchronous turnaround (often within a 24-hour window).

The decision is straightforward:

Real-time, user-facing → synchronous, optimize latency.
Background, can wait minutes to hours → batch API, take the discount.
High-volume, tolerant of delay → batch, and right-size the model per item.

Batching also smooths rate limits and reduces retry storms. The tradeoff is operational: you need a queue, idempotent jobs, and a way to reconcile partial failures. For anything that runs on a schedule rather than on a click, the discount is usually worth that plumbing.

Use Smaller Models for Sub-Tasks Inside a Pipeline

The most durable savings come from decomposition. A single "do everything" prompt to a frontier model is easy to write and expensive to run. Breaking a job into sub-tasks lets you route each step to the cheapest model that clears its bar.

A document-processing pipeline might look like:

Triage / route with a small model (or a fine-tuned classifier) — cents per thousand.
Extract structured fields with a small-to-mid model and a strict schema.
Synthesize the final judgment with a top-tier model, but only over the small, distilled context the earlier steps produced.

The frontier model now sees a fraction of the tokens and does only the part that needs its capability. This is also where fine-tuning or distillation earns its keep: if a narrow sub-task runs millions of times, distilling a small model on the frontier model's outputs can collapse that step's cost while holding quality. The tradeoff is upfront effort and a model you now have to maintain and re-evaluate as inputs drift — worth it only at volume.

Hold the Quality Line: The Tradeoffs That Bite

Every technique here has a failure mode. Name them before you ship.

Cheaper model, hidden regressions. Small models fail on long-context reasoning and instruction-following edge cases. Always re-run your eval set, not a handful of spot checks.
Cascades that escalate too often. Double-paying on a third of requests can cost more than never cascading. Monitor escalation rate as a first-class metric.
Caching that masks staleness. A cached prefix containing stale policy or pricing can produce confidently wrong answers. Version your cached content and invalidate on change.
Over-compression. Dropped context shows up as subtle quality loss, not loud errors. Gate on evals.
Optimizing the wrong thing. If 80% of spend is one task, shaving 5% off the other tasks is busywork. Follow the cost-per-task distribution.

The discipline that ties it together: a versioned eval set per task and a dashboard of cost-per-successful-task. Make a change, run the evals, compare quality and cost together. Ship only when quality holds. That loop is what separates real savings from a regression you will pay for later in churn.

Frequently Asked Questions

How much can I realistically cut from an LLM bill?

For a system that has never been optimized, 40% to 70% is common without touching quality, mostly from routing easy traffic to smaller models, prompt caching stable prefixes, and tightening RAG retrieval. The exact figure depends on how much of your traffic is genuinely easy. Diminishing returns set in once the largest task is right-sized.

Does prompt caching actually save money or just latency?

Both, but the savings are real. Cached input tokens are billed at a large discount versus fresh input tokens across major providers. The catch is that the cached content must be a stable, byte-identical prefix. If you inject timestamps or reorder tools per request, you break the cache and pay full price without realizing it.

When should I use a batch API instead of real-time calls?

Use batch for any work a user is not actively waiting on: backfills, nightly classification, enrichment, evals, and report generation. Providers typically offer around 50% off in exchange for asynchronous turnaround within a set window. The cost is operational complexity — queues, idempotency, failure reconciliation — which pays off for scheduled or high-volume jobs.

Should a model decide the routing, or should I use fixed rules?

Prefer cheap deterministic signals — task type, input length, a fast classifier — over asking an LLM to grade difficulty on every request. A router model adds its own token cost and latency, and a misroute is a silent quality regression. Reserve dynamic, model-driven escalation for cascades where a cheap model handles most traffic and only the hard tail escalates.

How do I avoid degrading quality while cutting cost?

Keep a versioned evaluation set per task and measure cost per successful task, not cost per call. Make one change at a time, re-run the evals, and compare quality and cost together. Ship only when quality holds. A cheaper model that triggers retries on a frontier model is more expensive than never switching.

Working with CodeAustral

We build and optimize production LLM systems for clients across web platforms, AI products, and restaurant tech — including the cost-per-task instrumentation, routing, and eval loops described here. If your inference bill is climbing faster than your usage, or you want a second opinion before swapping models, send us a short brief at codeaustral.com/contact and we will tell you where the real savings are.