A prompt that wins a demo and a prompt that survives production are rarely the same artifact. Demos forgive a clever one-shot string; products punish it the moment real users send inputs you never imagined, the model version shifts under you, or a junior engineer "improves" the wording at 2 a.m. This is a field guide to treating prompts as the load-bearing components they are: structured, versioned, tested, and owned.
The demo-to-production gap is an engineering gap, not a wording gap
Most teams that struggle with LLM features did not write a bad prompt. They wrote a prompt that was never engineered. It lives as a multi-line template literal in a request handler, interpolated with user data, edited in place whenever output looks off, and shipped without a test. It works in the demo because the demo uses three curated inputs.
In production the same prompt meets adversarial users, empty strings, 40 KB pastes, mixed languages, and a model upgrade six weeks later. The output drifts, nobody notices for a while because there are no assertions, and then a customer screenshot lands in support.
The fix is not a better sentence. It is the same discipline you already apply to any other critical dependency: a clear contract, a place where it is defined, a version, a test suite, and a rollback path. A prompt is code that runs on a probabilistic interpreter. Engineer it like it.
Structure the prompt: separate the stable from the volatile
The single most useful mental model is to split a prompt into layers by how often each part changes. This drives both correctness and cost.
- System layer — the role, the rules, the output contract, the refusal policy. Changes on the order of weeks. Identical across every request.
- Few-shot layer — curated input/output examples that pin down format and edge-case behavior. Changes when you learn something.
- Retrieved/context layer — documents, records, or tool results assembled per session.
- User layer — the actual request. Changes every call.
Keeping these physically ordered from most-stable to most-volatile is what makes prompt caching work, because caching is a prefix match: any byte change invalidates everything after it. Interpolate Current date: {now} into the top of your system prompt and you have just made your entire prompt uncacheable on every request. Put volatile values (timestamps, request IDs, the user's question) after the last cache breakpoint, and you can cut input cost on the stable prefix by roughly 90% while keeping latency low.
A concrete shape, using the Anthropic SDK with Claude Opus 4.8:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
// SYSTEM_PROMPT and FEW_SHOT are frozen module constants — never
// interpolated with per-request data. That keeps the cached prefix stable.
const response = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 4096,
thinking: { type: "adaptive" },
system: [
{ type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
],
messages: [
...FEW_SHOT, // curated assistant/user example pairs
{ role: "user", content: userInput }, // the only volatile part
],
});Note what is *not* in the system prompt: the date, the user's name, the tenant ID. Anything that varies goes into the message layer, where it invalidates nothing upstream.
Make the output contract enforceable, not aspirational
"Return JSON" in a prompt is a hope. A schema the runtime enforces is a contract. Modern models support structured outputs that constrain the response to a JSON schema, which removes an entire category of production failures: the markdown fence around the JSON, the trailing prose, the renamed field, the hallucinated extra key.
This matters more than it looks. Once output is schema-constrained, the boundary between "the model" and "the rest of your system" becomes a typed interface. You can parse with confidence, your downstream code gets real types, and a malformed response becomes a caught exception instead of a silent corruption three layers deep.
const SCHEMA = {
type: "object",
properties: {
sentiment: { type: "string", enum: ["positive", "neutral", "negative"] },
summary: { type: "string" },
action_required: { type: "boolean" },
},
required: ["sentiment", "summary", "action_required"],
additionalProperties: false,
} as const;
const response = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 1024,
output_config: { format: { type: "json_schema", schema: SCHEMA } },
messages: [{ role: "user", content: ticketBody }],
});Two production notes. First, schema-constrained output is still subject to truncation: if you hit the token cap the JSON arrives incomplete, so size max_tokens for the worst case and check the stop reason. Second, a safety refusal can return output that does not match your schema — handle the refusal path explicitly rather than assuming every 200 is parseable.
Few-shot examples are a feature surface, not decoration
The instinct in a demo is to keep adding instructions. In production, two or three well-chosen examples usually outperform another paragraph of prose, because they pin behavior on exactly the edge cases that prose describes ambiguously.
Choose examples for what they teach, not for being representative:
- One happy path example that shows the exact output shape.
- One edge case that the model gets wrong without it — the empty field, the ambiguous input, the "I don't know" case.
- One boundary example showing what *not* to do, formatted as the desired refusal or fallback.
Be disciplined about cost and drift. Every example is permanent input tokens on every call, and a stale example that contradicts a newer instruction quietly degrades quality. Treat the few-shot block as code: review changes to it, and delete examples that have stopped earning their tokens. With recent models, over-prescriptive few-shot scaffolding can actually *reduce* quality — when you upgrade, A/B the prompt with some examples removed before assuming more is better.
Version prompts like the dependency they are
If you cannot answer "which exact prompt produced this output, and what changed since last week," you cannot operate an LLM feature. Prompts must be versioned artifacts with the same rigor as a database migration.
Minimum viable discipline:
- Prompts live in source control as named, versioned files — not inline string literals scattered across handlers. One module, one export per prompt, a version identifier in the artifact.
- The version is logged with every call. Store the prompt version alongside the request ID and the model ID in your observability layer. When output quality shifts, you need to know whether the prompt changed, the model changed, or the inputs changed.
- Model ID is part of the prompt's contract. A prompt tuned for one model is not guaranteed on another. Pin the model ID explicitly and treat a model upgrade as a change that re-triggers your test suite.
A pragmatic pattern is a small registry:
export const PROMPTS = {
"ticket-triage@4": {
model: "claude-opus-4-8",
system: SYSTEM_PROMPT_V4,
schema: TRIAGE_SCHEMA,
},
} as const;
// At call time, log which one ran:
logger.info("llm_call", {
prompt: "ticket-triage@4",
model: PROMPTS["ticket-triage@4"].model,
requestId: response._request_id,
});When you change the system prompt, you cut a new version (@5), run the eval suite against both, and roll forward only if it wins. The old version stays available for instant rollback. This is unremarkable for a library dependency; it should be unremarkable for a prompt.
Test prompts with evals, not vibes
You would not ship a payments function whose only test is "it looked right when I tried it." A prompt deserves the same. The catch: outputs are non-deterministic and judged on quality, not exact equality. That changes the *shape* of the test, not the need for one.
Build an eval set: a fixed collection of inputs paired with assertions about acceptable outputs. The assertions are graded in layers, cheapest first:
| Check type | What it catches | How it grades |
|---|---|---|
| Structural | Malformed JSON, missing fields, wrong enum values | Deterministic — schema validation, exact match |
| Property-based | Wrong length, hallucinated entities, leaked PII, off-topic | Deterministic — regex, set membership, programmatic rules |
| Semantic | "Is this summary accurate and on-tone?" | LLM-as-judge against a rubric, or human review |
Most regressions are caught by the cheap deterministic layers — a renamed field, output that exceeds a length bound, a refusal that should have been an answer. Reserve the expensive judge-based checks for the genuinely subjective dimensions. Curate the eval set from real production failures: every time a bad output reaches a user, that input becomes a permanent test case. The set grows into an institutional memory of every way the feature has broken.
Run the eval suite in two places: in CI on any prompt or model change (gate the merge on a pass-rate threshold, not perfection), and continuously against a sample of live traffic to catch drift the static set never anticipated.
Avoid prompt rot: the slow decay nobody schedules
Prompt rot is the gradual divergence between what a prompt says and what the system needs. It is not caused by one bad edit. It accumulates:
- A rule is bolted on to fix one customer complaint, contradicting an earlier rule.
- A few-shot example references a product feature that was removed.
- A "CRITICAL: YOU MUST" instruction was added to overcome an old model's reluctance; the new model follows the system prompt closely and now over-triggers because of it.
- The output schema gained a field, but the prompt's instructions still describe the old shape.
Rot is insidious because nothing throws an error. The feature keeps working, slightly worse, until a threshold is crossed. Defenses that actually work:
- Keep prompts short and orthogonal. Every instruction should earn its place. When you add a rule, look for the one it supersedes and delete it, rather than letting both linger.
- Re-baseline on every model upgrade. A model change is the single most common trigger for latent rot to surface. Run the eval suite, and specifically review aggressive instruction language — newer models are more literal, so the guardrails you wrote for an older, more reluctant model often become liabilities. Dial them back to plain conditional language ("Use the lookup tool when the answer depends on current data") instead of escalating threats.
- Date and own your prompts. A prompt with no owner and no last-reviewed date is a prompt rotting in the dark. Put it in a
CODEOWNERSpath and review it on a cadence. - Treat the eval pass rate as a health metric. A slowly declining pass rate on a fixed eval set is the clearest early signal of rot. Watch it like you watch error rates.
A minimal production checklist
Before an LLM-backed feature ships, it should clear this bar:
- The prompt is a versioned file in source control, not an inline literal.
- The output is schema-constrained, and the parse path handles truncation and refusal.
- The stable prefix is ordered for caching; nothing volatile precedes a cache breakpoint.
- There is an eval set with deterministic checks, gating CI.
- Every call logs the prompt version, model ID, and request ID.
- There is a rollback path — a previous prompt version you can pin in one deploy.
- The model ID is pinned, and upgrades re-run the evals before rollout.
None of this is exotic. It is the same engineering hygiene you apply to any dependency that can fail in production. The only novelty is that the dependency is a prompt, and the interpreter is probabilistic — which is precisely why the discipline matters more, not less.
Frequently Asked Questions
What is the difference between a demo prompt and a production prompt?
A demo prompt is optimized for a handful of curated inputs and judged by eye. A production prompt is a versioned, schema-constrained artifact with an eval suite, structured logging of prompt and model versions, and a rollback path. The wording may be similar; the engineering around it is entirely different and is what makes it survivable.
How do you test prompts when outputs are non-deterministic?
Build an eval set of fixed inputs with layered assertions: deterministic structural checks (schema validity, required fields), property-based rules (length bounds, no leaked PII, correct enums), and reserved LLM-as-judge or human review for genuinely subjective quality. Grade cheapest-first, gate CI on a pass-rate threshold, and grow the set from real production failures.
What causes prompt rot and how do you prevent it?
Prompt rot is the slow divergence between what a prompt says and what the system needs — contradictory bolt-on rules, stale examples, and instructions tuned for a retired model. Prevent it by keeping prompts short and orthogonal, re-baselining on every model upgrade, assigning owners and review dates, and tracking the eval pass rate as a health metric since rot rarely throws an error.
Why should prompts be versioned in source control?
Because you cannot operate a feature you cannot reason about. Versioned prompt files let you answer which exact prompt produced a given output, run evals against old and new versions before rolling forward, roll back in a single deploy, and correlate quality shifts with prompt, model, or input changes through logged version identifiers. Inline string literals make all of that impossible.
Does structured output replace prompt engineering?
No — it removes one failure class. Schema-constrained output guarantees a parseable shape and a typed boundary between the model and your system, eliminating malformed JSON and renamed fields. It does not make the content correct, on-tone, or complete. You still need clear instructions, few-shot examples, evals, and handling for truncation and safety refusals, which can return output that does not match the schema.
Working with CodeAustral
We build AI-backed product features that hold up under real traffic — structured prompts, eval suites, versioning, and the observability to know when something drifts. If you have an LLM feature that demos well but you are not yet confident shipping it, send us a brief at https://codeaustral.com/contact and we will tell you, candidly, what it would take to make it production-grade.

