AI/ML · 14 Jun 2026 · Guide · By Babulal Tamang

LLM
Transformers
Inference
RAG

Large Language Models in Depth: From Tokens to Production Inference

A large language model (LLM) is a neural network trained to predict text—one token at a time—at a scale where fluency, reasoning, and tool use emerge from statistics, not hand-written rules. This guide goes inside the LLM stack: how text becomes numbers, how generation actually works, what training and alignment change, and what engineers must own when these models leave the demo and hit production traffic.

In short

LLMs are decoder-only transformers that model P(next token | context). Everything else—chat, RAG, agents, APIs—is engineering around that core loop. Success means mastering token budgets, decoding parameters, grounding, evaluation, and security—not treating the model as an infallible oracle.

What counts as an LLM?

An LLM is a foundation model specialized for language (and usually code): GPT-family, Claude, Gemini, Llama, Mistral, Qwen, and dozens of open-weight variants. “Large” refers to parameter count and training compute—billions of weights, terabytes of text—not to a fixed threshold. A 7B open model and a frontier closed API are both LLMs; they differ in capability, cost, and operational burden.

LLMs are almost always decoder-only transformers: they read context left-to-right and predict the next token. That differs from encoder-only models (BERT) used for classification and embeddings, and from encoder–decoder models (T5) built for explicit input→output pairs. For broader foundation-model context—modalities, CRFM definition, FM vs traditional ML—see AI Foundation Models in Depth. For history, see From Symbols to Foundation Models.

The generation loop (the whole product in one idea)

At inference time an LLM repeats:

Encode the prompt (system + user + optional retrieved documents) into token IDs.
Forward pass — run the transformer; output a probability distribution over the vocabulary for the next token.
Sample or select one token (greedy, temperature, top-p, etc.).
Append the token to the context and repeat until a stop condition (EOS token, max length, or custom stop sequence).

Chat UIs, copilots, and agents are wrappers on this loop. An agent adds: model emits a tool call → runtime executes → result appended to context → model continues. The LLM never “runs” your database; it proposes actions your code must validate and execute.

Tokenization: where billing and limits begin

Models do not see characters or words directly. Text is split into tokens—subword units learned by algorithms such as Byte Pair Encoding (BPE) or SentencePiece. Common patterns:

Frequent words may be one token; rare words split into several.
Code and JSON often consume more tokens than plain prose for the same character count.
Leading spaces, punctuation, and language matter—identical meaning can tokenize differently.

Why this matters for engineers:

API pricing is per input/output token.
Context windows are token caps, not character caps.
Latency scales with generated tokens and, during prefill, with prompt length.
Special tokens mark roles (<|user|>, <|assistant|>), tool boundaries, or end-of-sequence—each model family has its own chat template; sending the wrong format degrades quality.

Use the tokenizer shipped with the model (or the provider’s counting API) when estimating cost or fitting RAG chunks into context.

Inside the decoder: attention, position, and scale

Each layer applies multi-head self-attention: every token can attend to prior tokens (causal mask blocks the future). Attention lets the model bind pronouns to entities, follow long dependencies, and copy patterns from context— including patterns you did not intend (e.g. injected instructions in a retrieved document).

Modern LLMs add engineering refinements:

Rotary positional embeddings (RoPE) — encode position in a way that extrapolates to longer sequences better than fixed absolute embeddings.
Grouped-query attention (GQA) — fewer key/value heads than query heads; cuts KV-cache memory with small quality tradeoffs.
SwiGLU feed-forward blocks — standard in Llama-class architectures; most of the parameter count lives here.
RMSNorm, pre-norm — stabilize very deep stacks during training.

Parameter count (7B, 70B, 405B…) is capacity shorthand. Alignment, data quality, and inference stack often matter more than raw size for a given task.

How LLMs are trained (three phases you should recognize)

1. Pre-training

The model minimizes next-token prediction on massive corpora: web text, books, code, forums—filtered, deduplicated, and legally reviewed. It learns grammar, style, coding idioms, and a noisy store of facts. It does not learn to say “I don’t know” politely—that comes later.

Pre-training is cluster-scale work (thousands of GPUs, months). Scale laws showed loss often improves predictably with more data and compute until infrastructure or data quality limits dominate.

2. Supervised fine-tuning (SFT)

Curated (instruction, ideal reply) pairs teach the model to follow prompts, use formats, and adopt assistant behavior. SFT is how a base model becomes a chat model.

3. Preference optimization

RLHF (reinforcement learning from human feedback) trains a reward model on human rankings, then optimizes the policy. Alternatives—DPO, ORPO, KTO—skip full RL loops and are popular in open-source fine-tuning. These steps shape tone, refusals, and helpfulness; they do not guarantee factual accuracy on your private data.

Most teams adapt a released model (prompt, RAG, LoRA) rather than pre-training. That is the practical path described in How to Become an AI Developer.

Inference mechanics: KV cache, batching, and throughput

Autoregressive generation would be unbearably slow if every new token recomputed attention over the full history from scratch. Serving systems store key/value (KV) caches per layer for prior tokens and only compute attention for the new position.

Prefill — process the prompt in parallel (often compute-bound); dominates latency for long RAG contexts.
Decode — generate one (or a few) tokens per step (often memory-bandwidth-bound).
Continuous batching — schedulers (vLLM, TensorRT-LLM, TGI) pack requests dynamically to raise GPU utilization.
Speculative decoding — a small draft model proposes tokens; the large model verifies in parallel to cut steps.
Quantization — INT8/FP8/INT4 weights shrink memory and raise tokens/sec; watch eval regressions on your tasks.

Self-hosting with vLLM, Ollama, or llama.cpp trades vendor simplicity for GPU ops. Managed APIs (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex) trade margin for time-to-market. Hyperscaler service maps: AWS vs GCP vs Azure.

Decoding parameters (what “creativity” actually means)

The model outputs logits; your runtime turns them into a token choice:

Parameter	Effect	Typical use
Temperature	Scales logits before softmax; higher = more random	0–0.3 for extraction/JSON; 0.7–1.0 for brainstorming
Top-p (nucleus)	Sample from smallest set of tokens whose cumulative prob ≥ p	0.9–0.95 default for chat
Top-k	Restrict to k highest-probability tokens	Used with or instead of top-p
Max tokens	Hard cap on completion length	Cost and safety guardrail
Stop sequences	End generation when string appears	Prevent run-on; tool-call delimiters
Frequency / presence penalty	Discourage repetition	Long-form writing; use carefully on structured tasks

For structured output (JSON matching a schema), prefer constrained decoding or response-format APIs where available; lower temperature alone does not guarantee valid JSON.

Context windows: capacity is not attention quality

Leading models advertise 128k–1M+ token windows. In practice:

Cost and latency grow with prompt length; long RAG dumps can make prefill dominate.
Lost-in-the-middle — models may under-use information buried mid-prompt; put critical instructions and citations near the start or end.
Effective context for a task is often smaller than the advertised maximum; measure on your eval set.

Strategies: hierarchical summarization, reranking retrieved chunks, metadata filters, and “compress then answer” pipelines instead of stuffing entire wikis into one call.

Grounding with RAG (LLM-specific pitfalls)

Retrieval-augmented generation (RAG) feeds the LLM excerpts from your knowledge base before it answers. The LLM still predicts tokens; retrieval reduces—but does not eliminate—hallucination on proprietary facts.

Chunk documents with overlap; store embeddings in a vector index (pgvector, OpenSearch k-NN, Pinecone, etc.).
On query, retrieve top-k chunks (optionally hybrid keyword + vector).
Inject chunks into the prompt with clear delimiters and “answer only from sources” instructions.
Require citations or post-check that claims appear in retrieved text for high-stakes answers.

Failure modes: wrong chunks retrieved, stale embeddings, duplicated/conflicting sources, and indirect prompt injection when malicious text lives inside a document the model trusts. Refresh pipelines when sources change; version embeddings like schema migrations.

Adaptation beyond prompting

Prompt engineering — system role, few-shot examples, output schema in the prompt; version and A/B like application code.
LoRA / QLoRA — train small adapter matrices on domain tone, tool-call formats, or internal jargon; merge or hot-swap adapters per tenant.
Full fine-tune — rare for most products; risk of catastrophic forgetting; justify with eval gains.
Distillation — train a smaller model to mimic a larger one for cost-sensitive paths.

Fine-tune when prompting cannot stabilize format or behavior; use RAG when facts live outside the model; do not fine-tune to “upload” a knowledge base you could retrieve instead.

Evaluation: LLMs need product-specific tests

Perplexity on held-out text is a training metric, not a product metric. Build an eval harness:

Golden questions from real users and support tickets.
Rubric-based human review (correctness, tone, safety).
Automated checks: JSON schema validation, citation overlap with retrieved spans, refusal on blocked topics.
LLM-as-judge — useful for scale; weak when the judge shares the same blind spots; always spot-check humans.
Regression runs on every model version or prompt change—treat upgrades like dependency bumps with rollback.

Security and abuse

Prompt injection — untrusted content (emails, web pages, tickets) instructs the model to ignore policy. Mitigate with separation of system vs untrusted channels, output filtering, and least-privilege tools.
Jailbreaks — adversarial prompts bypass refusals; defense in depth, monitoring, and no single-model reliance for safety-critical gates.
Data leakage — PII in prompts logged to third parties; redact before send, regional endpoints, retention policies.
Tool abuse — agents with shell or SQL access need allow lists, confirmations, and timeouts.

Governance frameworks such as ISO/IEC 42001 complement technical controls for organizational accountability.

Production stack around the LLM

Client → API gateway (auth, rate limits)
      → Orchestrator (RAG, tools, memory)
      → Model service (API or self-hosted GPU)
      → Vector DB / data plane
      → Observability (tokens, latency, traces)
      → Safety filters + human escalation

Observability — log token usage, latency percentiles, error rates; sample prompts/responses under retention policy.
Caching — identical prompts only; beware stale answers when underlying data changes.
Routing — small model for classification, large model for hard reasoning; fallbacks when primary is down.
FinOps — budgets per team, per-feature token caps; see FinOps: stop guessing what the cloud costs for cost culture that applies to GPU and token bills alike.

Course-level vocabulary from AWS Generative AI Foundations and ML Foundations maps cleanly onto this stack even when you deploy outside AWS.

When an LLM is the wrong tool

Deterministic business rules with audit requirements — use code or rules engines.
Tabular prediction on structured logs — gradient boosting or classical ML often wins on cost and explainability.
Sub-millisecond latency at huge QPS — embeddings + classifiers, not 70B autoregressive decode.
Guaranteed correctness without human review — LLMs are probabilistic pattern completers, not theorem provers.

Open vs closed LLMs (decision sketch)

Choose closed API when…	Choose open weights when…
Speed to market, no GPU team	Data residency, air-gapped, custom fine-tune
Frontier reasoning matters most	Predictable unit economics at high volume
Vendor handles scaling and safety layers	You can operate vLLM/Kubernetes GPU pools

Many mature products use both: frontier API for hard tasks, distilled or open model for high-volume paths.

Where LLMs are heading

Longer context + better retrieval hybrids — raw window size alone does not solve grounding.
Multimodal chat defaults — documents, images, and audio in one thread.
Agents with stronger guardrails — more automation demands platform maturity (identity, audit, SRE).
Smaller capable models — on-device and edge inference for privacy-sensitive flows.
Regulation and procurement scrutiny — model cards, impact assessments, and ISO-style management systems become table stakes.

LLMs are not magic; they are compressed language statistics with a convenient interface. Teams that treat them as managed infrastructure—versioned prompts, measured quality, secured data paths—ship durable products; teams that treat them as oracles ship incident reports.

Blog