AI and ML Terminology: A Practical Glossary for Builders

Job posts, architecture reviews, and vendor decks reuse the same words—often with different meanings. This glossary defines the terms you will see in the wild, groups them by theme, and calls out pairs that people confuse.

In short

AI is the broad field; ML learns from data; deep learning uses neural networks; generative AI creates new outputs. Know training vs inference, parameters vs hyperparameters, RAG vs fine-tuning, and eval vs monitoring—then dive into the deep guides linked below.

How to use this page

Skim the scope diagram and often-confused pairs first. Use the themed sections as a lookup while reading papers, RFCs, or product docs. For implementation depth, follow the links to longer guides on this site—not every term needs a full tutorial here.

Scope: how the terms nest

Artificial Intelligence (AI)
└── Machine Learning (ML) — learn from data
    └── Deep Learning — neural networks with many layers
        └── Generative AI — create text, images, audio, code, …
            └── Large Language Models (LLMs) — language-focused foundation models

Not everything smart is ML. Rule engines, search indexes, and optimization solvers can be “AI” in marketing copy without training on labeled datasets. When someone says “we use AI,” ask whether they mean learned models, heuristics, or vendor APIs.

Often confused (quick reference)

Term A	Term B	Distinction
AI	ML	AI is the umbrella; ML is the subset that improves from data.
ML	Deep learning	Deep learning is ML with deep neural networks—dominant in vision and language today.
Training	Inference	Training adjusts weights offline; inference runs the model to produce outputs in production.
Parameter	Hyperparameter	Parameters are learned (weights); hyperparameters are set by humans (learning rate, layers).
Fine-tuning	RAG	Fine-tuning changes weights; RAG retrieves external documents at query time. See RAG in depth.
Prompting	Fine-tuning	Prompting steers a frozen model; fine-tuning updates weights on domain data.
Embedding	Token	Tokens are model input units; embeddings are dense vectors (often of tokens or documents).
Accuracy	Loss	Accuracy is a human-facing metric; loss is what optimization minimizes during training.
Model	Dataset	The model is the artifact; the dataset is what you train on—quality caps both.
AGI	Gen AI	AGI (aspirational general intelligence) ≠ generative AI (models that create content).

1. Big picture

Artificial intelligence (AI): Systems that perform tasks associated with human intelligence—perception, language, reasoning, planning. In product language it often means anything automated; in engineering, be specific.
Machine learning (ML): Programs that improve performance on a task through experience (data), without hand-coding every rule.
Deep learning: ML using neural networks with many layers; backbone of modern vision, speech, and language systems.
Generative AI: Models that create new samples (text, images, audio, video, code) rather than only assigning labels. Guide: Generative AI in depth.
Discriminative model: Learns boundaries between classes (e.g. spam vs not spam). Contrasts with generative models that model how data is produced.
Foundation model: Large model pre-trained on broad data, then adapted via prompting, fine-tuning, or RAG. See AI foundation models in depth.
Artificial general intelligence (AGI): Hypothetical systems with human-level generality across tasks—not the same as today’s narrow or foundation models.

2. Learning paradigms

Supervised learning: Train on input–label pairs (e.g. email → spam/ham). Most tabular and classification workflows.
Unsupervised learning: Find structure without labels—clustering, dimensionality reduction, anomaly detection.
Semi-supervised learning: Mix of labeled and unlabeled data—common when labels are expensive.
Self-supervised learning: Create supervision from the data itself (e.g. predict masked words)—how most LLMs are pre-trained.
Reinforcement learning (RL): Agent learns via rewards and penalties; used in games, robotics, and RLHF for alignment.
Reinforcement learning from human feedback (RLHF): Fine-tune models using human preference rankings—common post-training step for chat models.
Transfer learning: Reuse a model trained on one task or dataset for another—foundation models are the extreme case.
Online vs batch learning: Online updates from a stream; batch retrains on scheduled snapshots—production trade-offs differ.

3. Data, features, and labels

Dataset / corpus: Collection of examples used for training or evaluation—quality and diversity matter more than raw size alone.
Label / ground truth: Correct answer for supervised training—errors in labels become errors in the model.
Feature: Input signal the model uses—columns in tabular ML, pixels in vision, tokens in language.
Feature engineering: Designing or transforming inputs (scaling, encoding, aggregations)—still critical for classical ML.
Feature store: Shared, versioned features for training and serving so train/serve skew stays low.
Train / validation / test split: Hold data out to tune hyperparameters (validation) and report final performance (test)—never tune on the test set.
Data leakage: Information from the future or target sneaks into features—metrics look great, production fails.
Class imbalance: Rare positive class (fraud, defects)—accuracy misleads; use precision, recall, F1, or cost-sensitive metrics.
Data drift: Input distribution changes over time (new users, new products)—model quality can drop without retraining.
Concept drift: Relationship between inputs and labels changes—the world changed, not just the data mix.

4. Models and architecture

Model / artifact: Trained weights plus metadata (version, metrics, lineage)—what you deploy or call via API.
Neural network: Layers of connected units (neurons) with non-linear activations—universal approximators in theory, data-hungry in practice.
Layer / hidden layer: Intermediate representations between input and output—“deep” means many such layers.
Parameter (weight): Learned values in the network—billions in large LLMs (“7B”, “70B” refer to parameter count).
Hyperparameter: Human-chosen settings: learning rate, batch size, architecture depth—not updated by gradient descent on a single training run’s loss in the same way as weights.
CNN (convolutional neural network): Architecture for images and spatial data—local filters, translation invariance.
RNN / LSTM: Sequence models before transformers dominated—still seen in legacy time-series stacks.
Transformer: Architecture based on self-attention—core of modern LLMs and many multimodal systems.
Attention / self-attention: Mechanism to weigh which parts of the input matter for each output position.
Encoder / decoder: Encoder maps input to representations; decoder generates output—many LLMs are decoder-only.
Multimodal model: Handles more than one modality (text + image, text + audio) in one system.
Ensemble: Combine multiple models (voting, averaging) for stability—common in classical ML competitions.

5. Training and optimization

Training: Adjust model weights to minimize loss on training data—GPU-heavy for large models.
Pre-training: Large-scale training on broad data (e.g. next-token prediction)—produces a base foundation model.
Fine-tuning: Further training on narrower data to specialize behavior—instruction tuning is a common form.
Instruction tuning: Fine-tune on prompt–response pairs so the model follows user instructions in chat UIs.
Alignment: Post-training to match human values and policies—RLHF, constitutional AI, red-teaming.
Loss function: Scalar the optimizer minimizes (cross-entropy, MSE, etc.)—proxy for task quality, not always identical to business metrics.
Gradient descent: Update weights in the direction that reduces loss—backpropagation computes gradients.
Epoch: One full pass over the training dataset.
Batch size: Examples per gradient update—larger batches need more memory; affect training dynamics.
Learning rate: Step size for weight updates—too high diverges; too low trains forever.
Overfitting: Model memorizes training data, fails on new data—regularization, more data, or simpler models help.
Underfitting: Model too simple to capture the signal—needs capacity, features, or longer training.
Regularization: Penalties (L1/L2, dropout) that discourage complexity and reduce overfitting.
Checkpoint: Saved snapshot of weights during training—for resume, A/B, or rollback.

6. Inference and serving

Inference: Running the trained model to get predictions or generations—what users hit in production.
Latency / throughput: Time per request vs requests per second—LLMs care about time-to-first-token and tokens/sec.
Batch inference: Score many inputs together for efficiency—higher latency per item, lower cost per row.
Real-time inference: Low-latency online scoring—often autoscaled endpoints or serverless with cold-start trade-offs.
Model serving: Hosting layer (SageMaker, Triton, vLLM, Bedrock) that loads artifacts and exposes APIs.
Quantization: Lower precision weights (INT8, INT4) for smaller memory and faster inference—slight quality trade-off.
Distillation: Train a smaller “student” model to mimic a larger “teacher”—cheaper deployment.
KV cache: Stores attention keys/values during autoregressive generation—critical for fast LLM inference. See LLMs in depth.

7. Generative AI and language models

Large language model (LLM): Foundation model for text (and often code)—predicts next tokens; powers chat and agents.
Token: Subword unit the model reads/writes—context limits and API billing are often token-based.
Tokenization: Splitting text into tokens (BPE, SentencePiece)—affects languages, code, and rare words.
Context window: Maximum tokens in one request—prompt + completion must fit.
Prompt: Instructions and context sent to the model—templates and versioning matter in production.
System / user / assistant messages: Chat roles—system sets behavior; user asks; assistant responds.
Completion / generation: Model output stream—often sampled, not deterministic.
Temperature / top-p / top-k: Sampling controls—higher temperature = more random; lower = more deterministic.
Hallucination: Fluent but false output—mitigate with RAG, citations, evals, and human review.
Grounding: Tying answers to verified sources (docs, DB, tool results).
Closed vs open weights: API-only (GPT-4 class) vs downloadable weights (Llama, Mistral)—privacy, cost, and control differ.

8. RAG, agents, and application patterns

Embedding: Dense vector representation of text or images—similar meaning → nearby vectors in search.
Vector database / vector search: Store embeddings and retrieve nearest neighbors—Pinecone, pgvector, OpenSearch k-NN, etc.
Chunking: Splitting documents for retrieval—chunk size and overlap strongly affect RAG quality.
RAG (retrieval-augmented generation): Retrieve relevant chunks, inject into the prompt, then generate—grounds answers in your data. RAG in depth.
Reranking: Second-stage model scores retrieved chunks for relevance before the LLM sees them.
Agent: LLM loop that plans, calls tools (APIs, SQL, code), and iterates—needs guardrails and audit logs.
Tool use / function calling: Model emits structured calls your app executes—calendar, DB, ticket systems.
Prompt injection: Untrusted text tricks the model into ignoring policies—treat user content as hostile input.
Guardrails: Filters and policies on inputs/outputs—PII redaction, topic blocks, schema validation.

9. Evaluation and quality

Metric: Number that summarizes performance—choose metrics aligned with user harm and business goals.
Baseline: Simple reference (rules, random, previous model)—proves the new approach earns its complexity.
Accuracy: Fraction correct—misleading on imbalanced classes.
Precision / recall / F1: Precision: of predicted positives, how many are right. Recall: of actual positives, how many you found. F1 balances both.
ROC-AUC / PR-AUC: Threshold-independent summaries for binary classifiers—PR-AUC often better when positives are rare.
RMSE / MAE: Regression errors—root mean square vs mean absolute.
BLEU / ROUGE: N-gram overlap with reference text—weak alone for open-ended LLM answers.
Perplexity: How surprised the model is by held-out text—common in language modeling, not always user-facing.
Benchmark: Standardized test suite (MMLU, HumanEval)—useful for comparison, not a substitute for your eval set.
Golden set / eval harness: Curated questions with expected properties—run on every prompt or model change.
LLM-as-judge: Another model scores outputs—scale evals; validate against human ratings.
A/B test: Live comparison of variants on real traffic—ultimate test for product impact.

10. Production, MLOps, and platform

MLOps: Practices to ship and operate ML reliably—pipelines, versioning, monitoring, governance.
ML pipeline: Automated flow: ingest → validate → train → evaluate → register → deploy.
Model registry: Catalog of approved artifacts with lineage and promotion stages (staging → prod).
Experiment tracking: Log hyperparameters, metrics, and artifacts per run—Weights & Biases, MLflow, etc.
Train/serve skew: Training features differ from production features—silent quality killer.
Shadow deployment / canary: Run new model alongside old; shift traffic gradually after metrics hold.
Observability: Logs, metrics, traces—plus ML-specific: prediction distributions, data drift, cost per request.
GPU / TPU: Accelerators for training and heavy inference—capacity planning and quotas matter on cloud.
SageMaker / Bedrock (AWS): Managed training/serving vs managed foundation-model APIs—see ML Foundations and Generative AI Foundations.

11. Safety, ethics, and governance

Bias / fairness: Unequal error or harm across groups—measure disaggregated metrics; fix data and objectives.
Explainability: How much you can justify a prediction—SHAP for tabular; citations and traces for LLMs.
PII / privacy: Personal data in training or logs—minimize, redact, and respect retention policies.
Responsible AI: Safety, transparency, accountability, and compliance built into the lifecycle—not a checklist at launch.
AI management system (e.g. ISO/IEC 42001): Organizational framework for AI risk and governance—see ISO/IEC 42001 AI audits in depth and Lead Auditor notes.
Red teaming: Adversarial testing for misuse, jailbreaks, and harmful outputs before release.

Acronyms at a glance

ML	Machine learning
DL	Deep learning
LLM	Large language model
NLP	Natural language processing
CV	Computer vision
RAG	Retrieval-augmented generation
RLHF	Reinforcement learning from human feedback
MLOps	ML + DevOps practices for production ML
API	Application programming interface (here: model endpoints)
GPU	Graphics processing unit (used for ML compute)
LoRA	Low-rank adaptation—efficient fine-tuning
PEFT	Parameter-efficient fine-tuning

Where to go next

How to become an AI developer — career map, stack, and learning path (includes a shorter glossary).
AI historical paradigms — how the field evolved.
LLMs in depth · Generative AI in depth · RAG in depth

Blog index · Foundation models · AWS ML Foundations

Back to blog list