RAG in Depth: Retrieval-Augmented Generation from First Principles to Production

Retrieval-augmented generation (RAG) connects a language model to knowledge it was never trained on—or knowledge that changed yesterday—by searching a corpus at query time and injecting the best passages into the prompt. It is the default pattern for enterprise Q&A, support copilots, and internal search because it is cheaper and more auditable than retraining, and more controllable than hoping the model “remembers” your wiki.

In short

RAG = ingest documents → chunk → embed → store vectors → on each user question retrieve top-k relevant chunks → paste them into the LLM context → generate an answer (ideally with citations). Quality lives in chunking, retrieval, access control, and evaluation—not in picking the largest model available.

What is RAG?

Retrieval-augmented generation was named and formalized by Lewis et al. (2020): combine a retriever that finds relevant text with a generator (typically a seq2seq or transformer language model) that conditions its answer on those passages. The model’s weights hold parametric knowledge (patterns learned during training); your vector index holds non-parametric knowledge (documents you control). At inference time, retrieval bridges the gap.

Without RAG, a chatbot answers from whatever was in pre-training plus the prompt. That fails when:

  • Answers must come from private data (HR policy, runbooks, customer tickets).
  • Information changes frequently (pricing, product docs, regulations).
  • You need citations and audit trails for compliance.
  • The domain uses rare vocabulary the base model under-represents.

RAG does not eliminate hallucination, but it gives the model grounded text to read first—so errors become easier to detect (wrong citation, ignored passage) and to fix (better retrieval, not bigger models).

For how LLMs fit into the broader AI stack, see AI Foundation Models in Depth and How to Become an AI Developer.

The RAG pipeline (end to end)

Production RAG is two systems: an offline indexing path and an online query path.

Offline (ingest):

  1. Load — PDF, HTML, Markdown, Confluence, tickets, SQL exports, code repos.
  2. Parse & clean — strip boilerplate, fix encoding, preserve structure (headings, tables).
  3. Chunk — split into segments sized for embedding and context budget.
  4. Enrich metadata — source URL, title, section, ACL, last-updated, product line.
  5. Embed — map each chunk to a dense vector (and optionally sparse keywords).
  6. Index — store vectors + metadata in a vector DB or search engine.

Online (query):

  1. Receive question — user chat, API, or agent tool call.
  2. Optional query rewrite — HyDE, multi-query, or decomposition for harder questions.
  3. Retrieve — similarity search, filters, hybrid BM25 + vector, rerank top candidates.
  4. Assemble context — pack chunks into the prompt under token limits; deduplicate overlaps.
  5. Generate — LLM answers using only provided context (instruction: cite sources).
  6. Post-process — citation links, guardrails, logging, feedback for eval loops.

Latency is usually dominated by retrieval + one LLM call; indexing runs asynchronously when documents change.

RAG vs fine-tuning vs long context vs tools

Approach Best for Tradeoffs
RAG Fresh facts, private docs, citations, many sources Retrieval quality is the product; needs index ops
Fine-tuning / LoRA Stable tone, format, tool-calling, domain phrasing Expensive to refresh knowledge; risk of forgetting
Long context only Small corpus that fits in one window Cost and latency scale with tokens; “lost in the middle”
Tools / APIs Live data (inventory, weather, tickets) Requires reliable APIs and agent orchestration

Mature products combine them: RAG for knowledge, tools for live systems, light fine-tuning or prompts for brand voice. See the adaptation section in foundation models guide for when to fine-tune instead of retrieve.

Chunking: where most RAG projects win or lose

Embeddings represent a chunk, not a whole document. Bad chunks retrieve irrelevant text even with a perfect embed model.

  • Fixed-size windows — e.g. 512 tokens with 10–20% overlap; simple, works for uniform prose.
  • Structure-aware — split on headings, paragraphs, or Markdown sections; keeps procedures intact.
  • Recursive character splitting — try large separators first (##), then smaller (\n, space); common in LangChain/LlamaIndex defaults.
  • Parent–child — index small chunks for search, attach larger parent span for generation context.
  • Semantic chunking — break when embedding similarity between sentences drops; fewer mid-thought cuts.

Rules of thumb: keep code blocks and tables whole; include section title in every chunk text; store chunk_id and source_id for citations; tune size to your embed model (often 256–1024 tokens).

Embeddings and vector search

An embedding model maps text to a high-dimensional vector so that semantically similar passages sit close in cosine distance. Popular families include OpenAI text-embedding-3, Cohere embed, open models (BGE, E5, GTE), and multimodal embedders for images in docs.

  • Same model for index and query — never mix embed models on one index without re-embedding.
  • Normalize vectors — many stores assume unit length for cosine.
  • Dimension vs cost — higher dims can help recall but increase storage and RAM.
  • Domain fit — legal or code-heavy corpora may need models trained or fine-tuned on similar text.

Vector stores include pgvector (PostgreSQL), Pinecone, Weaviate, Qdrant, Milvus, Redis, OpenSearch k-NN, and cloud offerings (Bedrock Knowledge Bases, Azure AI Search). Choose based on ops maturity, hybrid search needs, filtering, and tenancy—not benchmark hype alone.

Hybrid search blends dense vectors with sparse lexical scores (BM25). It helps on exact SKUs, error codes, and rare tokens that embeddings blur. Many teams retrieve 50–100 candidates hybrid, then rerank with a cross-encoder (Cohere rerank, bge-reranker) down to 5–10 for the LLM.

Retrieval strategies beyond “top-k cosine”

  • Metadata filters — restrict to user’s team, product, or doc type before vector search (security + precision).
  • MMR (maximal marginal relevance) — reduce redundant chunks that say the same thing.
  • Query expansion / multi-query — LLM generates alternate phrasings; union results.
  • HyDE — generate a hypothetical answer, embed it, retrieve similar real docs (useful for vague questions).
  • Step-back prompting — retrieve for a generalized question plus the specific one.
  • GraphRAG — build entity graphs for “how does X relate to Y across the corpus?” summaries.

Start simple: filtered top-k + reranker. Add complexity when evals show recall gaps on real user questions—not because a blog said HyDE is trendy.

Prompting and generation

A typical RAG prompt has: system (role, safety, “answer only from context”), context block (numbered chunks with source labels), and user question. Example structure:

System: You are a support assistant. Use ONLY the passages below.
If the answer is not in the passages, say you do not know.

Context:
[1] (source: runbooks/db-failover.md) ...
[2] (source: runbooks/db-failover.md) ...

User: How do we fail over the primary database?

Strong patterns: require inline citations like [1]; forbid blending external knowledge for compliance modes; cap temperature for factual Q&A; stream tokens for UX. For structured outputs (JSON tickets), use schema-constrained decoding or tool definitions.

Advanced RAG patterns

  • Agentic RAG — model decides when to search, reformulate, or call tools; multiple retrieval rounds.
  • Corrective RAG (CRAG) — grade retrieval quality; if poor, web search or rewrite query before answering.
  • Self-RAG — model reflects on whether retrieval is needed and whether the draft is supported.
  • Fusion-in-decoder / multi-vector — research directions; production often stays naive RAG + good evals.

These add latency and failure modes. Instrument each step; fall back to “I could not find documentation” when retrieval scores are low.

Evaluation: measure retrieval and answers separately

Teams ship RAG that “looks smart” in demos but fails on edge cases. Build a golden set of real user questions with expected source docs and acceptable answers.

Retrieval metrics:

  • Recall@k — is the right document in the top k chunks?
  • MRR / nDCG — rank quality when multiple docs matter.

Generation metrics:

  • Faithfulness / groundedness — is the answer supported by retrieved text?
  • Answer relevance — does it address the question?
  • Citation accuracy — do cited spans exist and match?
  • LLM-as-judge — useful at scale; calibrate against human labels.

Run evals in CI when you change chunking, embed models, or prompts—same discipline as unit tests for APIs.

Production architecture and operations

  • Ingestion jobs — event-driven on CMS publish, nightly crawl, or webhook; track versions and deletes (tombstone stale chunks).
  • Access control — filter by user ACL at query time; never rely on the LLM to hide sensitive chunks.
  • Caching — embed query once; cache retrieval for identical questions; watch for stale answers.
  • Observability — log query, retrieved ids, scores, prompt token count, latency, model version; trace end-to-end.
  • Cost — embed once per chunk; batch embed jobs; use smaller chat models when retrieval is strong.
  • Security — prompt injection via uploaded docs (“ignore instructions”); sanitize uploads, separate system prompt from user content, consider input/output filters.

On AWS, patterns map to S3 (raw docs), Lambda or Glue (ETL), OpenSearch / Aurora pgvector (index), Bedrock (embed + chat), IAM and KMS for tenancy. Compare services in hyperscaler service mapping and Generative AI Foundations.

Common failure modes (and fixes)

  • “Right answer, wrong doc” — improve chunking and metadata; add hybrid search + reranker.
  • “Model ignores context” — stronger system prompt, lower temperature, cite-or-refuse policy, smaller context with only top reranked chunks.
  • “Stale content” — shorten ingest lag; surface “last updated” in UI; version indexes.
  • “Slow and expensive” — cache retrieval; shrink context; use distill/rerank only on candidates; right-size LLM.
  • “Works in English only” — multilingual embed models; same pipeline per locale index.
  • “SQL RAG confusion” — for structured data, use text-to-SQL or APIs instead of embedding table rows naively.

For relational data and pgvector, see SQL course notes.

Governance and responsible RAG

RAG corpora may contain PII, licensed text, or regulated records. Policies should cover consent, retention, right-to-delete (re-embed after removal), and logging of what was shown to whom. Align technical controls with AI management frameworks such as ISO 42001. Tell users when answers are AI-generated and which sources were used.

A minimal build order (learning path)

  1. Index 20–50 Markdown files with fixed chunks + one embed model + pgvector or local Chroma.
  2. Add hybrid search and a reranker; compare Recall@5 on 30 real questions.
  3. Add citations in the UI; run faithfulness checks manually.
  4. Wire ACL filters and ingest webhooks; add tracing and cost dashboards.
  5. Only then explore agents, GraphRAG, or fine-tuning for tone.

Further reading

  • Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)
  • Gao et al. — surveys on RAG architectures and evaluation
  • LangChain, LlamaIndex, and cloud provider RAG workshops (patterns, not gospel)
  • MTEB leaderboard — embedding model benchmarks (domain-specific validation still required)

Blog index · Foundation models · How to become an AI developer · Generative AI Foundations · SQL course

Back to blog list