AI and ML Terminology: A Practical Glossary for Builders

Job posts, architecture reviews, and vendor decks reuse the same words—often with different meanings. This glossary defines the terms you will see in the wild, groups them by theme, and calls out pairs that people confuse.

In short

AI is the broad field; ML learns from data; deep learning uses neural networks; generative AI creates new outputs. Know training vs inference, parameters vs hyperparameters, RAG vs fine-tuning, and eval vs monitoring—then dive into the deep guides linked below.

How to use this page

Skim the scope diagram and often-confused pairs first. Use the themed sections as a lookup while reading papers, RFCs, or product docs. For implementation depth, follow the links to longer guides on this site—not every term needs a full tutorial here.

Scope: how the terms nest

Artificial Intelligence (AI)
└── Machine Learning (ML) — learn from data
    └── Deep Learning — neural networks with many layers
        └── Generative AI — create text, images, audio, code, …
            └── Large Language Models (LLMs) — language-focused foundation models

Not everything smart is ML. Rule engines, search indexes, and optimization solvers can be “AI” in marketing copy without training on labeled datasets. When someone says “we use AI,” ask whether they mean learned models, heuristics, or vendor APIs.

Often confused (quick reference)

Term ATerm BDistinction
AIMLAI is the umbrella; ML is the subset that improves from data.
MLDeep learningDeep learning is ML with deep neural networks—dominant in vision and language today.
TrainingInferenceTraining adjusts weights offline; inference runs the model to produce outputs in production.
ParameterHyperparameterParameters are learned (weights); hyperparameters are set by humans (learning rate, layers).
Fine-tuningRAGFine-tuning changes weights; RAG retrieves external documents at query time. See RAG in depth.
PromptingFine-tuningPrompting steers a frozen model; fine-tuning updates weights on domain data.
EmbeddingTokenTokens are model input units; embeddings are dense vectors (often of tokens or documents).
AccuracyLossAccuracy is a human-facing metric; loss is what optimization minimizes during training.
ModelDatasetThe model is the artifact; the dataset is what you train on—quality caps both.
AGIGen AIAGI (aspirational general intelligence) ≠ generative AI (models that create content).

1. Big picture

Artificial intelligence (AI)
Systems that perform tasks associated with human intelligence—perception, language, reasoning, planning. In product language it often means anything automated; in engineering, be specific.
Machine learning (ML)
Programs that improve performance on a task through experience (data), without hand-coding every rule.
Deep learning
ML using neural networks with many layers; backbone of modern vision, speech, and language systems.
Generative AI
Models that create new samples (text, images, audio, video, code) rather than only assigning labels. Guide: Generative AI in depth.
Discriminative model
Learns boundaries between classes (e.g. spam vs not spam). Contrasts with generative models that model how data is produced.
Foundation model
Large model pre-trained on broad data, then adapted via prompting, fine-tuning, or RAG. See AI foundation models in depth.
Artificial general intelligence (AGI)
Hypothetical systems with human-level generality across tasks—not the same as today’s narrow or foundation models.

2. Learning paradigms

Supervised learning
Train on input–label pairs (e.g. email → spam/ham). Most tabular and classification workflows.
Unsupervised learning
Find structure without labels—clustering, dimensionality reduction, anomaly detection.
Semi-supervised learning
Mix of labeled and unlabeled data—common when labels are expensive.
Self-supervised learning
Create supervision from the data itself (e.g. predict masked words)—how most LLMs are pre-trained.
Reinforcement learning (RL)
Agent learns via rewards and penalties; used in games, robotics, and RLHF for alignment.
Reinforcement learning from human feedback (RLHF)
Fine-tune models using human preference rankings—common post-training step for chat models.
Transfer learning
Reuse a model trained on one task or dataset for another—foundation models are the extreme case.
Online vs batch learning
Online updates from a stream; batch retrains on scheduled snapshots—production trade-offs differ.

3. Data, features, and labels

Dataset / corpus
Collection of examples used for training or evaluation—quality and diversity matter more than raw size alone.
Label / ground truth
Correct answer for supervised training—errors in labels become errors in the model.
Feature
Input signal the model uses—columns in tabular ML, pixels in vision, tokens in language.
Feature engineering
Designing or transforming inputs (scaling, encoding, aggregations)—still critical for classical ML.
Feature store
Shared, versioned features for training and serving so train/serve skew stays low.
Train / validation / test split
Hold data out to tune hyperparameters (validation) and report final performance (test)—never tune on the test set.
Data leakage
Information from the future or target sneaks into features—metrics look great, production fails.
Class imbalance
Rare positive class (fraud, defects)—accuracy misleads; use precision, recall, F1, or cost-sensitive metrics.
Data drift
Input distribution changes over time (new users, new products)—model quality can drop without retraining.
Concept drift
Relationship between inputs and labels changes—the world changed, not just the data mix.

4. Models and architecture

Model / artifact
Trained weights plus metadata (version, metrics, lineage)—what you deploy or call via API.
Neural network
Layers of connected units (neurons) with non-linear activations—universal approximators in theory, data-hungry in practice.
Layer / hidden layer
Intermediate representations between input and output—“deep” means many such layers.
Parameter (weight)
Learned values in the network—billions in large LLMs (“7B”, “70B” refer to parameter count).
Hyperparameter
Human-chosen settings: learning rate, batch size, architecture depth—not updated by gradient descent on a single training run’s loss in the same way as weights.
CNN (convolutional neural network)
Architecture for images and spatial data—local filters, translation invariance.
RNN / LSTM
Sequence models before transformers dominated—still seen in legacy time-series stacks.
Transformer
Architecture based on self-attention—core of modern LLMs and many multimodal systems.
Attention / self-attention
Mechanism to weigh which parts of the input matter for each output position.
Encoder / decoder
Encoder maps input to representations; decoder generates output—many LLMs are decoder-only.
Multimodal model
Handles more than one modality (text + image, text + audio) in one system.
Ensemble
Combine multiple models (voting, averaging) for stability—common in classical ML competitions.

5. Training and optimization

Training
Adjust model weights to minimize loss on training data—GPU-heavy for large models.
Pre-training
Large-scale training on broad data (e.g. next-token prediction)—produces a base foundation model.
Fine-tuning
Further training on narrower data to specialize behavior—instruction tuning is a common form.
Instruction tuning
Fine-tune on prompt–response pairs so the model follows user instructions in chat UIs.
Alignment
Post-training to match human values and policies—RLHF, constitutional AI, red-teaming.
Loss function
Scalar the optimizer minimizes (cross-entropy, MSE, etc.)—proxy for task quality, not always identical to business metrics.
Gradient descent
Update weights in the direction that reduces loss—backpropagation computes gradients.
Epoch
One full pass over the training dataset.
Batch size
Examples per gradient update—larger batches need more memory; affect training dynamics.
Learning rate
Step size for weight updates—too high diverges; too low trains forever.
Overfitting
Model memorizes training data, fails on new data—regularization, more data, or simpler models help.
Underfitting
Model too simple to capture the signal—needs capacity, features, or longer training.
Regularization
Penalties (L1/L2, dropout) that discourage complexity and reduce overfitting.
Checkpoint
Saved snapshot of weights during training—for resume, A/B, or rollback.

6. Inference and serving

Inference
Running the trained model to get predictions or generations—what users hit in production.
Latency / throughput
Time per request vs requests per second—LLMs care about time-to-first-token and tokens/sec.
Batch inference
Score many inputs together for efficiency—higher latency per item, lower cost per row.
Real-time inference
Low-latency online scoring—often autoscaled endpoints or serverless with cold-start trade-offs.
Model serving
Hosting layer (SageMaker, Triton, vLLM, Bedrock) that loads artifacts and exposes APIs.
Quantization
Lower precision weights (INT8, INT4) for smaller memory and faster inference—slight quality trade-off.
Distillation
Train a smaller “student” model to mimic a larger “teacher”—cheaper deployment.
KV cache
Stores attention keys/values during autoregressive generation—critical for fast LLM inference. See LLMs in depth.

7. Generative AI and language models

Large language model (LLM)
Foundation model for text (and often code)—predicts next tokens; powers chat and agents.
Token
Subword unit the model reads/writes—context limits and API billing are often token-based.
Tokenization
Splitting text into tokens (BPE, SentencePiece)—affects languages, code, and rare words.
Context window
Maximum tokens in one request—prompt + completion must fit.
Prompt
Instructions and context sent to the model—templates and versioning matter in production.
System / user / assistant messages
Chat roles—system sets behavior; user asks; assistant responds.
Completion / generation
Model output stream—often sampled, not deterministic.
Temperature / top-p / top-k
Sampling controls—higher temperature = more random; lower = more deterministic.
Hallucination
Fluent but false output—mitigate with RAG, citations, evals, and human review.
Grounding
Tying answers to verified sources (docs, DB, tool results).
Closed vs open weights
API-only (GPT-4 class) vs downloadable weights (Llama, Mistral)—privacy, cost, and control differ.

8. RAG, agents, and application patterns

Embedding
Dense vector representation of text or images—similar meaning → nearby vectors in search.
Vector database / vector search
Store embeddings and retrieve nearest neighbors—Pinecone, pgvector, OpenSearch k-NN, etc.
Chunking
Splitting documents for retrieval—chunk size and overlap strongly affect RAG quality.
RAG (retrieval-augmented generation)
Retrieve relevant chunks, inject into the prompt, then generate—grounds answers in your data. RAG in depth.
Reranking
Second-stage model scores retrieved chunks for relevance before the LLM sees them.
Agent
LLM loop that plans, calls tools (APIs, SQL, code), and iterates—needs guardrails and audit logs.
Tool use / function calling
Model emits structured calls your app executes—calendar, DB, ticket systems.
Prompt injection
Untrusted text tricks the model into ignoring policies—treat user content as hostile input.
Guardrails
Filters and policies on inputs/outputs—PII redaction, topic blocks, schema validation.

9. Evaluation and quality

Metric
Number that summarizes performance—choose metrics aligned with user harm and business goals.
Baseline
Simple reference (rules, random, previous model)—proves the new approach earns its complexity.
Accuracy
Fraction correct—misleading on imbalanced classes.
Precision / recall / F1
Precision: of predicted positives, how many are right. Recall: of actual positives, how many you found. F1 balances both.
ROC-AUC / PR-AUC
Threshold-independent summaries for binary classifiers—PR-AUC often better when positives are rare.
RMSE / MAE
Regression errors—root mean square vs mean absolute.
BLEU / ROUGE
N-gram overlap with reference text—weak alone for open-ended LLM answers.
Perplexity
How surprised the model is by held-out text—common in language modeling, not always user-facing.
Benchmark
Standardized test suite (MMLU, HumanEval)—useful for comparison, not a substitute for your eval set.
Golden set / eval harness
Curated questions with expected properties—run on every prompt or model change.
LLM-as-judge
Another model scores outputs—scale evals; validate against human ratings.
A/B test
Live comparison of variants on real traffic—ultimate test for product impact.

10. Production, MLOps, and platform

MLOps
Practices to ship and operate ML reliably—pipelines, versioning, monitoring, governance.
ML pipeline
Automated flow: ingest → validate → train → evaluate → register → deploy.
Model registry
Catalog of approved artifacts with lineage and promotion stages (staging → prod).
Experiment tracking
Log hyperparameters, metrics, and artifacts per run—Weights & Biases, MLflow, etc.
Train/serve skew
Training features differ from production features—silent quality killer.
Shadow deployment / canary
Run new model alongside old; shift traffic gradually after metrics hold.
Observability
Logs, metrics, traces—plus ML-specific: prediction distributions, data drift, cost per request.
GPU / TPU
Accelerators for training and heavy inference—capacity planning and quotas matter on cloud.
SageMaker / Bedrock (AWS)
Managed training/serving vs managed foundation-model APIs—see ML Foundations and Generative AI Foundations.

11. Safety, ethics, and governance

Bias / fairness
Unequal error or harm across groups—measure disaggregated metrics; fix data and objectives.
Explainability
How much you can justify a prediction—SHAP for tabular; citations and traces for LLMs.
PII / privacy
Personal data in training or logs—minimize, redact, and respect retention policies.
Responsible AI
Safety, transparency, accountability, and compliance built into the lifecycle—not a checklist at launch.
AI management system (e.g. ISO/IEC 42001)
Organizational framework for AI risk and governance—see ISO/IEC 42001 AI audits in depth and Lead Auditor notes.
Red teaming
Adversarial testing for misuse, jailbreaks, and harmful outputs before release.

Acronyms at a glance

MLMachine learning
DLDeep learning
LLMLarge language model
NLPNatural language processing
CVComputer vision
RAGRetrieval-augmented generation
RLHFReinforcement learning from human feedback
MLOpsML + DevOps practices for production ML
APIApplication programming interface (here: model endpoints)
GPUGraphics processing unit (used for ML compute)
LoRALow-rank adaptation—efficient fine-tuning
PEFTParameter-efficient fine-tuning

Where to go next

Blog index · Foundation models · AWS ML Foundations

Back to blog list