Skip to content
published2026 · 03 · 14topicRAG · Architectureread22 minaudienceengineering leads

Production RAG architecture: what survives contact with real users.

A walk through the design choices that separate a working demo from a production retrieval system: chunking, hybrid search, re-ranking, evaluation, and the cost discipline that makes the unit economics work.

Prelude

Every RAG system looks the same on a whiteboard: embed, retrieve, prompt, respond. Most of them work brilliantly in a demo. Then the corpus grows, the queries get messy, the latency budget tightens, the cost dashboard catches up, and someone in the on-call rotation has to explain why the answers got worse this week.

This note is about the patterns that keep a retrieval system honest after that point. None of it is novel. All of it is operationally non-negotiable once you put a RAG system in front of real users.

The pipeline.

We treat RAG as a five-stage pipeline, each with explicit observability and an explicit failure mode. The shape rarely changes between projects; the parameters do. The stages: query → hybrid retrieve → re-rank → filter / dedupe → prompt build → llm + stream.

Treating each stage as independently testable is the single most useful operational decision we make. When quality regresses, we can localize the regression to a stage instead of debating the whole system.

Chunking is product design, not preprocessing.

The default (split documents into N-token chunks with overlap) is fine for blog corpora and useless for almost everything else. Real corpora have structural meaning: invoices have line items, contracts have clauses, research papers have sections, support docs have headings.

The chunking strategy is the first design surface that determines whether retrieval will hold up:

  • Respect document structure first, token boundaries second. A clause that spans 1,400 tokens is one chunk, not three.
  • Carry hierarchical context. Each chunk should know what document, section, and tenant it came from. Cheap metadata, enormous downstream payoff.
  • Multi-granularity indexing. Embed chunks at two levels: fine-grained for precision, document-level for routing.
Revisit your chunking

If your chunking strategy is "1,000 tokens with 200 overlap" and you've never revisited it, you're leaving 30–40% of retrieval quality on the table for almost any non-trivial corpus.

Hybrid retrieval: both, always.

Pure semantic search misses exact-match queries (model numbers, names, codes). Pure lexical search misses paraphrase. Hybrid retrieval (semantic and BM25, fused at score time) outperforms either alone on every real corpus we've shipped.

The fusion strategy matters more than the individual retrievers:

fusion.pypython
# reciprocal rank fusion with tunable weights
def rrf(rankings, k=60, weights=None):
    scores = defaultdict(float)
    for i, ranking in enumerate(rankings):
        w = weights[i] if weights else 1.0
        for rank, doc in enumerate(ranking):
            scores[doc.id] += w / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

Re-ranking is where quality is bought.

The weights are corpus-specific. Tune them with the evaluation harness; never with vibes. Retrieval gets you the candidate set. Re-ranking decides which of the candidates the model actually sees. For most production systems, the re-ranker is doing more for end-quality than any single other component.

We default to a cross-encoder re-ranker, typically a hosted Cohere or Voyage model, or a small self-hosted reranker for sensitive corpora. The latency cost is real (50–150ms) but the quality dividend is consistent. The re-ranker is also the right place to enforce policy:

  • Tenant isolation. Drop any candidate not belonging to the requesting tenant. Yes, this should also happen at retrieval: defence in depth.
  • Recency boosts. If a document has been superseded, demote it. Cheap to implement, expensive when it's missing.
  • Source diversity. Avoid five chunks from the same document dominating the prompt.

Evaluation harness on day one.

Every RAG project we've shipped had an evaluation harness running in CI from week one. Every one. There is no exception where it was worth skipping. Without it, you ship by feeling, and your feelings will betray you the first time a prompt change looks better but is measurably worse.

The minimal eval set:

  • Retrieval recall@k against a labeled set of queries with known-relevant chunks.
  • Answer faithfulness: does the answer cite the retrieved context, and is it supported?
  • Tail-query coverage: does it handle the queries the demo set didn't?
  • Latency budget: p50 and p95, per stage.
  • Cost per query: token cost, retrieval cost, total dollars per 1k queries.
The eval harness alone changed how we ship. We stopped guessing whether a prompt change was an improvement.

Cost discipline is architectural.

Most RAG systems we audit have the same problem: cost grows linearly with usage and there is no per-tenant ceiling. That's an architectural choice, not a price-list problem to solve later.

The patterns that keep economics honest:

  • Model tiering with explicit routing. Flagship models for synthesis, fast models for extraction, OSS for sensitive data.
  • Per-tenant cost ceilings. Enforced in code, not in a runbook. The right place is usually the orchestration layer.
  • Aggressive retrieval caching. Most query traffic is non-unique. Cache the retrieval step; you'll save real money.
  • Streaming termination. If the user navigates away, stop the generation. Don't pay for tokens nobody will see.

Closing.

None of this is novel. It's the operational discipline that gets a RAG system from demo to production and keeps it there. If you're past the demo phase and any of the above feels missing, that's the place to start.

We're happy to talk if you're stuck somewhere in this pipeline. Or stay subscribed: the next note picks up where this one stops: observability for LLM systems isn't APM.

Building a production RAG system?

We architect, build, and operate retrieval systems for enterprise SaaS. Same engineers, from architecture to on-call.