Retrieval-augmented generation is the workhorse pattern behind most production AI features that need to answer from your own data: support assistants, internal search, document Q&A, agent memory. The idea is simple — retrieve the right context, then let the model reason over it — but the gap between a weekend demo and something you can put in front of paying users is wide. This guide walks the full pipeline as we build it at CodeAustral, with concrete TypeScript you can adapt, and an honest take on where RAG earns its keep versus fine-tuning.
What a Production RAG Pipeline Actually Looks Like
A RAG system is a sequence of stages, and a weakness in any one of them caps the quality of the whole. The end-to-end flow:
- Ingestion and chunking — split source documents into retrievable units.
- Embedding — turn each chunk into a vector.
- Vector storage and indexing — persist vectors with metadata for fast similarity search.
- Retrieval — fetch candidate chunks for a query (usually hybrid: vector + keyword).
- Reranking — reorder candidates by true relevance before they hit the prompt.
- Grounding — assemble a prompt that forces the model to answer from retrieved context and cite it.
- Evaluation and observability — measure retrieval and answer quality, continuously.
The demo skips 4 through 7. The production system lives or dies by them. Most "RAG doesn't work" complaints trace back to retrieval quality, not the model.
Chunking: The Decision That Quietly Sets Your Ceiling
Chunking determines what can ever be retrieved. Get it wrong and no amount of model quality recovers — the relevant sentence simply isn't in any retrievable unit, or it's buried in a 3,000-token wall of noise.
Principles that hold up in production:
- Chunk on structure, not character count. Split on headings, list items, and paragraph boundaries first. A naive
text.slice(0, 1000)cuts mid-sentence and mid-table. - Target 200–500 tokens per chunk for most prose. Smaller chunks improve retrieval precision; larger chunks preserve context. Bias smaller and recover context at retrieval time.
- Overlap by 10–15% so a fact spanning a boundary survives in at least one chunk.
- Carry metadata on every chunk — source URL, document title, section heading, timestamp. You need this for filtering, citation, and debugging.
A pragmatic structure-aware splitter in TypeScript:
interface Chunk {
text: string;
metadata: {
sourceId: string;
title: string;
heading?: string;
position: number;
};
}
function chunkMarkdown(
doc: { id: string; title: string; body: string },
{ maxChars = 1600, overlap = 200 } = {},
): Chunk[] {
// Split on headings/blank lines first to respect document structure.
const blocks = doc.body.split(/\n(?=#{1,6}\s)|\n\s*\n/);
const chunks: Chunk[] = [];
let buffer = "";
let heading: string | undefined;
let position = 0;
const flush = () => {
if (!buffer.trim()) return;
chunks.push({
text: buffer.trim(),
metadata: { sourceId: doc.id, title: doc.title, heading, position: position++ },
});
buffer = buffer.slice(-overlap); // carry overlap into the next chunk
};
for (const block of blocks) {
const h = block.match(/^#{1,6}\s+(.*)$/m);
if (h) heading = h[1];
if (buffer.length + block.length > maxChars) flush();
buffer += `\n${block}`;
}
flush();
return chunks;
}Roughly four characters per token, so maxChars = 1600 lands near 400 tokens. Tune for your corpus: dense technical docs want smaller chunks; narrative content tolerates larger ones.
Embeddings and the Vector Store
Each chunk becomes a vector via an embedding model. The two decisions that matter:
- Pick one embedding model and stay on it. Vectors from different models aren't comparable. Switching models means re-embedding the entire corpus — budget for it.
- Match dimensions to your store and your scale. Higher-dimensional embeddings capture more nuance but cost more to store and search. For most corpora under a few million chunks, a 1,024–1,536 dimension model is the sweet spot.
For storage, you do not need a dedicated vector database to start. `pgvector` on Postgres is the right default for most teams: you keep one operational system, get transactional consistency between chunks and metadata, and can filter on structured columns in the same query. Reach for a specialized store (Qdrant, Pinecone, Milvus) when you're past tens of millions of vectors or need sharded, high-QPS search.
A minimal pgvector schema with hybrid search support:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
source_id TEXT NOT NULL,
title TEXT,
heading TEXT,
content TEXT NOT NULL,
embedding VECTOR(1024) NOT NULL,
tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Approximate nearest-neighbour index for vector search
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
-- Full-text index for keyword search
CREATE INDEX ON chunks USING gin (tsv);The hnsw index gives fast approximate nearest-neighbour search; the gin index on the generated tsvector powers the keyword half of hybrid retrieval.
Retrieval: Go Hybrid From Day One
Pure vector search is excellent at semantic similarity ("how do I cancel" matches "subscription termination") and weak at exact matches — product SKUs, error codes, names, version numbers. Pure keyword search is the opposite. Hybrid retrieval runs both and fuses the results, and it is the single highest-leverage upgrade over a naive pipeline.
The robust fusion method is Reciprocal Rank Fusion (RRF): score each document by the sum of 1 / (k + rank) across each ranked list. It needs no score normalization between the two systems, which is what makes it durable in production.
type Scored = { id: number; rank: number };
function reciprocalRankFusion(lists: Scored[][], k = 60): Map<number, number> {
const fused = new Map<number, number>();
for (const list of lists) {
for (const { id, rank } of list) {
fused.set(id, (fused.get(id) ?? 0) + 1 / (k + rank));
}
}
return fused; // sort descending by value to get final order
}
async function hybridRetrieve(query: string, queryVec: number[], limit = 40) {
const [vectorHits, keywordHits] = await Promise.all([
sql`SELECT id FROM chunks
ORDER BY embedding <=> ${toVector(queryVec)} LIMIT ${limit}`,
sql`SELECT id FROM chunks
WHERE tsv @@ plainto_tsquery('english', ${query})
ORDER BY ts_rank(tsv, plainto_tsquery('english', ${query})) DESC
LIMIT ${limit}`,
]);
const toRanked = (rows: { id: number }[]) =>
rows.map((r, i) => ({ id: r.id, rank: i + 1 }));
const fused = reciprocalRankFusion([toRanked(vectorHits), toRanked(keywordHits)]);
return [...fused.entries()].sort((a, b) => b[1] - a[1]).map(([id]) => id);
}Retrieve generously here — 40 or so candidates — because the next stage exists to trim them down intelligently.
Reranking: Cheap Insurance for Answer Quality
Embedding similarity is a fast approximation of relevance, not relevance itself. A reranker is a model that scores each (query, chunk) pair directly and reorders them. You retrieve 40 candidates cheaply, rerank them, and keep the top 5–8 for the prompt.
This matters because LLMs are sensitive to context quality and ordering. Stuffing 40 loosely related chunks into the prompt degrades the answer and burns tokens; 6 tightly relevant chunks, best-first, produces noticeably better grounding. Reranking is the step most teams skip and most regret skipping — the quality delta is large relative to its cost and latency.
Use a dedicated cross-encoder reranker if you have one; otherwise an LLM-as-reranker works well at small candidate counts. Either way, the rule holds: retrieve wide, rerank, then ground narrow.
Grounding: Make the Model Answer From Context — and Cite It
Grounding is where you assemble the final prompt and constrain the model to use only what you retrieved. The system prompt should instruct the model to answer from the provided context, cite the chunks it used, and explicitly say when the context is insufficient rather than fall back on parametric knowledge. That last instruction is what prevents confident hallucinations.
Tag each chunk with a stable identifier so the model can cite it and you can render source links. Here we use the official Anthropic SDK with Claude Opus 4.8:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function answer(query: string, chunks: Chunk[]) {
const context = chunks
.map((c, i) => `[${i + 1}] (${c.metadata.title} — ${c.metadata.heading ?? ""})\n${c.text}`)
.join("\n\n");
const system =
"Answer the user's question using ONLY the numbered context below. " +
"Cite the sources you use inline as [n]. " +
"If the context does not contain the answer, say so explicitly — do not use outside knowledge.";
const stream = client.messages.stream({
model: "claude-opus-4-8",
max_tokens: 1024,
system: [{ type: "text", text: `${system}\n\nContext:\n${context}` }],
messages: [{ role: "user", content: query }],
});
return (await stream.finalMessage()).content;
}Two production notes baked into that snippet. First, we stream and resolve with finalMessage() — streaming protects against request timeouts on longer answers while still giving you the complete response. Second, the large, stable instruction block sits at the front of the system prompt; if you serve many queries against a shared corpus segment, adding a cache_control breakpoint there turns repeated context into cache reads at roughly a tenth of the input cost.
Evaluation: You Cannot Improve What You Don't Measure
RAG quality is not a single number, and "it looks good" is not an evaluation. Measure the two stages separately, because they fail differently.
Retrieval metrics (does the right chunk get fetched and ranked highly?):
- Recall@k — is the correct chunk in the top *k* retrieved?
- MRR / nDCG — is it ranked near the top, not just present?
Generation metrics (does the answer use the context correctly?):
- Faithfulness / groundedness — is every claim supported by a retrieved chunk?
- Answer relevance — does it actually address the question?
- Citation accuracy — do the cited
[n]markers point to chunks that support the claim?
Build a golden set of 50–200 real questions with known-good answers and sources, and run it on every change to chunking, embeddings, or prompts. An LLM-as-judge works well for faithfulness and relevance scoring at scale. Pair offline eval with online observability: log every query, the retrieved chunk IDs, the rerank scores, and the final answer, so you can debug the specific failures users hit.
When RAG Beats Fine-Tuning (and When It Doesn't)
These solve different problems and are not substitutes. The decision:
- Use RAG when the answer depends on facts — knowledge that changes, is large, is private, or must be cited. RAG updates the moment you update the index, needs no training run, and gives you traceable sources. This covers the large majority of "answer from our data" use cases.
- Use fine-tuning when you need to change *behavior* — a consistent format, tone, domain phrasing, or a narrow classification skill — rather than inject facts. Fine-tuning teaches the model *how* to respond, not *what's true today*.
- Use both when you want grounded facts delivered in a specialized style: RAG supplies the context, fine-tuning shapes the output. This is the right answer more often than teams expect.
The common mistake is fine-tuning to "teach the model our documentation." Models fine-tuned on facts still hallucinate, can't cite, and go stale the day the docs change. For knowledge, retrieval is almost always the better tool. Start with RAG; reach for fine-tuning only when you've identified a genuine behavioral gap that prompting and context can't close.
Frequently Asked Questions
What is the ideal chunk size for RAG?
For most prose, target 200–500 tokens (roughly 800–2,000 characters) with 10–15% overlap. Smaller chunks improve retrieval precision; larger ones preserve context. Bias toward smaller chunks and recover surrounding context at retrieval time. Split on document structure — headings, paragraphs, list items — rather than fixed character counts, which cut mid-sentence and destroy meaning.
Do I need a dedicated vector database?
Usually not at first. pgvector on Postgres handles corpora up to several million chunks well, keeps your stack simple, and lets you filter on metadata in the same query as vector search. Move to a specialized store like Qdrant, Pinecone, or Milvus when you exceed tens of millions of vectors or need sharded, high-throughput search with strict latency targets.
Why is hybrid search better than vector search alone?
Vector search excels at semantic similarity but misses exact matches like product codes, names, and version numbers. Keyword search is the reverse. Hybrid retrieval runs both and fuses results — typically with Reciprocal Rank Fusion, which needs no score normalization. It is the single highest-impact improvement over a naive embedding-only pipeline and is worth implementing from day one.
How do I stop a RAG system from hallucinating?
Combine three things: retrieve and rerank so the right context is actually present; instruct the model to answer only from provided context and to say when it's insufficient rather than guess; and require inline citations so every claim is traceable. Then measure faithfulness against a golden test set on every change. Hallucinations usually trace back to weak retrieval, not the model.
Should I use RAG or fine-tuning?
Use RAG for facts — knowledge that is private, large, changing, or must be cited. Use fine-tuning to change behavior, such as output format, tone, or a narrow classification skill. They are complementary, not competing: RAG keeps answers current and traceable, while fine-tuning shapes how those answers are expressed. For most "answer from our data" needs, start with RAG.
Working with CodeAustral
We build retrieval-augmented systems, AI products, and web platforms for clients worldwide — and we've shipped enough of them to know that the difference between a demo and a dependable product is in the retrieval, reranking, and evaluation work most teams underestimate. If you're planning a RAG feature, an internal knowledge assistant, or an AI product and want a team that has done the unglamorous parts, send us a brief at codeaustral.com/contact. We're happy to talk through your corpus, your constraints, and the simplest architecture that will hold up in production.

