Velero in Depth: Backup, Restore, and Disaster Recovery for Kubernetes
Velero is the CNCF-graduated tool most platform teams reach for when they need portable Kubernetes backups: cluster resources via the API, optional persistent volume snapshots or file-system copies, schedules, and restores into the same cluster or a new one. It does not replace your cloud provider’s control plane—but it gives you application-level DR you can test, audit, and run on a schedule.
In short
Install Velero with a BackupStorageLocation (object store) and optionally a VolumeSnapshotLocation → create Backup or Schedule resources (or use the CLI) → Velero serializes API objects and volume data to S3/GCS/Azure/MinIO → Restore recreates resources and rehydrates volumes. Pair with RBAC, encryption, restore drills, and clear scope (namespaces vs cluster). etcd snapshots and Velero solve different problems—use both in mature DR design.
Why Velero exists
Kubernetes state lives in etcd (the control plane datastore) and in PersistentVolumes (data on disks, CSI drivers, cloud volumes). When something goes wrong—a bad upgrade, accidental kubectl delete, AZ loss, ransomware, or a cluster you must abandon—you need a way to put workloads back.
Options teams confuse:
- etcd backup — Point-in-time recovery of the entire API object store; powerful for control-plane disaster, heavy operationally, not portable across cloud SKUs, and often restricted on managed Kubernetes (EKS, GKE, AKS).
- Cloud “cluster backup” add-ons — Convenient but vendor-shaped; harder to restore into a lab or another region with different networking.
- GitOps repos — Great for desired state of Deployments and ConfigMaps you committed; useless for PVC contents, Secrets you never committed, CRD instance data, or resources created only in-cluster.
- Volume snapshots alone — Restore disks but not Deployments, Services, Ingress, or RBAC that pointed at them.
Velero (originally Heptio Ark, now a CNCF graduated project) backs up and restores Kubernetes API resources you select, plus persistent volume data through CSI snapshots or node-level file-system backup (Kopia). Backups land in object storage you control—so restores are repeatable across clusters, regions, and providers when plugins and storage classes align.
For cluster anatomy and what Velero does not back up (nodes, kubelet config), see Kubernetes architecture in simple terms. For volume mechanics, see PV, PVC, and StorageClass and CRI and CSI.
Velero vs etcd backup vs application-level backup
| Approach | What is captured | Typical use | Limitation |
|---|---|---|---|
| Velero | Selected API resources + PV data (snapshot or FS) | Namespace DR, migration, pre-upgrade safety net | Must include CRDs/operators; restore order and storage class mapping need planning |
| etcd snapshot | Whole cluster state in etcd | Control-plane catastrophe on self-managed clusters | Managed K8s often disallows; not granular; version skew sensitive |
| Database / app backup | Logical data (Postgres dump, S3 replication) | RPO/RTO for business data | Does not recreate K8s Services, Ingress, or Helm releases |
| GitOps | Manifests in Git | Day-to-day desired state | No PVC bytes; drift if someone edited live objects |
Production DR usually layers these: GitOps for steady state, Velero for namespace-scoped “rewind,” application backup for databases, and provider snapshots or replication where RPO demands it. Velero is the Kubernetes-native glue between “we have YAML in Git” and “we have data on disk.”
Architecture: server, node agent, storage locations, plugins
A typical Velero install includes:
- Velero server — Deployment in
veleronamespace; runs controllers that process Backup, Restore, Schedule, and related CRDs; talks to the Kubernetes API and object storage. - Node agent — DaemonSet on each node (or eligible nodes) that performs file-system volume backups using Kopia (successor to Restic in modern Velero). Required when you use
defaultVolumesToFsBackupor pod annotations for FS backup instead of CSI snapshots. - BackupStorageLocation (BSL) — Where backup tarballs and metadata JSON go (S3, GCS, Azure Blob, MinIO, etc.).
- VolumeSnapshotLocation (VSL) — Optional; tells Velero which cloud snapshot API to use per provider plugin (EBS, PD, Azure Disk, Portworx, etc.).
- Plugins — Init containers or separate images that register provider-specific object-store and snapshot behavior (AWS, Azure, GCP, vSphere, CSI, …).
kubectl API Velero server Object store (BSL)
│ │ │
│◄── list/watch resources ───────┤──► backup tarball + metadata ─►│ S3 / GCS / …
│ │
CSI / node agent ◄── volume data ────┤──► optional snapshot in cloud ─►│ EBS / PD / …
Backup flow in plain language:
- You trigger a backup (CLI or
BackupCR). - Velero discovers resources matching scope (whole cluster, namespaces, label selectors, resource types).
- It backs up API objects in a defined order (CRDs and namespaces before dependents where possible).
- For each PVC, Velero either creates a volume snapshot via VSL/CSI or runs a pod volume backup via the node agent into the BSL.
- Metadata and object JSON are compressed into files under a backup prefix in the bucket.
Restore reverses the flow: objects are recreated in the target cluster; snapshots are cloned or FS data is copied into new PVCs according to your restore flags and storage class mappings.
Core CRDs and objects
Velero extends the API with project.crd.io resources (exact API group/version follow your install; commonly velero.io/v1):
- Backup — One-shot backup; status fields show phase, errors, volume counts, expiration.
- Restore — Recreates resources from a named backup; supports namespace mapping, resource filters, and item actions.
- Schedule — Cron-driven recurring backups; links to a Backup spec template.
- BackupStorageLocation — Bucket/prefix, credentials, default flag.
- VolumeSnapshotLocation — Region/project and snapshot class for a cloud.
- DeleteBackupRequest — Async removal of backup data from object storage.
- DownloadRequest — Pull logs or partial backup contents for support.
You can operate entirely through velero CLI (which creates these CRs) or manage them with GitOps like any other cluster resource—see GitOps principles.
Installation on a lab cluster
Official install uses the Velero CLI to lay down server, RBAC, and CRDs, then register BSL/VSL. Example for AWS (EKS) with S3 and EBS snapshots—adjust bucket, region, and ARNs:
# Install CLI (macOS example)
brew install velero
# Velero server + AWS plugin (credentials via env or IRSA on the server SA)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket my-cluster-backups \
--backup-location-config region=eu-west-1,s3ForcePathStyle=false \
--snapshot-location-config region=eu-west-1 \
--secret-file ./credentials-velero
kubectl -n velero get pods
On EKS, prefer IRSA (IAM Roles for Service Accounts) over long-lived keys in a Secret—aligns with least-privilege RBAC and IAM policy design. The Velero service account assumes a role that can write to the backup bucket and create/describe snapshots in the target region.
Helm chart (vmware-tanzu/velero) is common in production: pin image and plugin versions, set resources, node agent tolerations for tainted nodes, and metrics ServiceMonitor if you use Prometheus.
Backup scope: what gets included
Velero backs up resources by querying the API server—not by ssh-ing to nodes. You control scope with flags or the Backup spec:
--include-namespaces/--exclude-namespaces— Most teams backup production namespaces and skipkube-system,velero, monitoring sandboxes unless required.--include-resources/--exclude-resources— e.g. skipevents,endpointslicesto shrink size.--selector— Label selector on resources (useful for tenant-based backup).- Cluster-scoped resources — CRDs, ClusterRoles, PVs: include when restoring into a fresh cluster needs them; exclude when only app namespaces matter.
- Hooks — Pre/post backup exec hooks in Pods (quiesce databases, flush buffers)—see below.
# Namespace-scoped backup with FS default for all PVCs in that backup
velero backup create app-daily \
--include-namespaces production,staging \
--default-volumes-to-fs-backup \
--ttl 720h0m0s
velero backup describe app-daily --details
TTL on backups and schedules garbage-collects old restore points in object storage—set explicitly so buckets do not grow without bound (FinOps tie-in: FinOps in plain English).
Persistent volumes: snapshots, CSI, and file-system backup
This is where most backup designs succeed or fail.
Cloud snapshot path (VSL / CSI)
When a VSL or CSI snapshot driver is configured, Velero asks the storage layer for a snapshot of the volume backing a PVC. Restore creates a new volume from that snapshot (or from a copied snapshot in another region if you replicate). Requirements:
- Storage class and CSI driver supported by your Velero plugin.
- IAM permissions for snapshot create/describe/delete.
- Same cloud account/region—or a documented copy process for cross-region DR.
Fast and efficient for large databases on EBS/Persistent Disk; not all drivers expose snapshot APIs Velero can call uniformly.
File-system backup (node agent / Kopia)
For local-path volumes, NFS without snapshots, or when snapshot APIs are unavailable, Velero mounts the volume via a temporary pod on the node and streams file data to the BSL using Kopia. Enable per backup with --default-volumes-to-fs-backup or per-Pod annotation:
metadata:
annotations:
backup.velero.io/backup-volumes: data,config
Trade-offs: backup duration scales with data size and disk IO; node agent must run on nodes hosting the Pods; restores need compatible StorageClasses and enough node disk for restore pods.
Opt-out
Ephemeral caches or emptyDir-only apps can skip volume backup:
backup.velero.io/backup-volumes-excludes: scratch
Document which StatefulSets require snapshots vs FS backup in your platform runbook—operators like Postgres often need hooks plus consistent snapshot timing.
Hooks: quiescing applications
Crash-consistent copies of running databases risk corruption. Velero supports backup hooks—commands executed in containers before/after backup:
apiVersion: v1
kind: Pod
metadata:
annotations:
pre.hook.backup.velero.io/container: postgres
pre.hook.backup.velero.io/command: '["/bin/bash","-c","pg_start_backup(''velero'')"]'
post.hook.backup.velero.io/container: postgres
post.hook.backup.velero.io/command: '["/bin/bash","-c","pg_stop_backup()"]'
Use provider-native tools when hooks are insufficient (e.g. logical dumps to S3 alongside Velero for Postgres). Hooks are not a substitute for application RPO testing.
Schedules and backup sync
velero schedule create nightly-prod \
--schedule="0 2 * * *" \
--include-namespaces production \
--ttl 336h0m0s
Velero’s backup sync controller (when enabled) can mirror object storage backups into another cluster’s view—useful for hub-and-spoke DR where a standby cluster lists backups taken elsewhere. Design encryption in transit (TLS to S3) and at rest (SSE-KMS, GCS CMEK) on the bucket—see S3 in depth for bucket policies and encryption patterns.
Restore: same cluster, new namespace, or disaster cluster
# Restore into same cluster (failed namespace delete, bad deploy)
velero restore create restore-app-daily --from-backup app-daily
# Map namespaces on restore (drill: prod → prod-drill)
velero restore create drill-2026-05 \
--from-backup app-daily \
--namespace-mappings production:production-drill
# Restore only some resources
velero restore create partial \
--from-backup app-daily \
--include-resources deployments,services,configmaps,secrets,persistentvolumeclaims
Important restore behaviors:
- Storage class mapping —
--storage-class-mappingswhen target cluster uses different default StorageClasses (e.g.gp2→gp3). - Node ports and LoadBalancers — May conflict if restored alongside live Services; use namespace mapping or selective resource restore for drills.
- Admission webhooks and CRDs — Restore CRDs before CR instances; Velero’s restore order handles much of this, but custom validating webhooks can block creates—test restores quarterly.
- Resource policies — Velero 1.12+ resource policies can skip cluster-scoped objects or transform labels on restore—worth adopting for multi-tenant platforms.
Cross-cluster migration: install Velero on destination with access to the same BSL (or replicated bucket), then restore with mappings for Ingress class, TLS Secrets, and external DNS. Pair with incident and disaster response runbooks so restores are practiced, not invented during an outage.
CLI reference (commands you use weekly)
| Command | Purpose |
|---|---|
velero backup create | On-demand backup |
velero backup get / describe / logs | Status and troubleshooting |
velero backup delete | Remove backup from bucket (async) |
velero restore create | Restore from backup |
velero restore get / describe / logs | Restore progress and errors |
velero schedule create / get | Cron backups |
velero snapshot-location get | Verify VSL |
velero plugin add / get | Provider plugins |
velero version | Client/server compatibility |
Security, RBAC, and compliance
- Velero’s cluster RBAC is broad by necessity—it can read Secrets and all namespaced resources you include. Restrict who can create Backup/Restore CRs; use separate service accounts per environment.
- Object store IAM — Write-only for backup jobs, read for restore roles; deny public ACLs; enable versioning and MFA delete on production buckets where policy allows.
- Encryption — SSE on bucket; optionally client-side encryption features in Velero for regulated data.
- Secrets in backups — Backups contain Secret objects; treat buckets as sensitive as etcd. Some teams exclude Secrets and rely on External Secrets Operator to rehydrate—document the trade-off.
- Audit — Log Velero API activity; alert on unexpected Restore objects in production.
Observability and operations
- Expose Velero metrics (
backup_attempt_total,restore_success_total, failure gauges) to Prometheus; dashboard backup last success time per schedule. - Alert when scheduled backup phase is
Failedor PartiallyFailed for N runs. - Run restore drills into isolated namespaces monthly; measure RTO honestly (large FS restores are not instant).
- Version skew: keep CLI within one minor version of server; upgrade server before node agents.
Production readiness checklist
- BSL in a dedicated account/region with replication or cross-region copy for DR targets.
- VSL or CSI snapshot path tested for every StorageClass used in production.
- FS backup node agent scheduled on all node pools that run stateful Pods (taints/tolerations verified).
- Schedules with TTL; lifecycle rules on bucket for cost control.
- Hooks or logical backup for databases with strict consistency requirements.
- Documented namespace include/exclude list; no accidental full-cluster backup of every Event.
- Restore runbook with namespace mappings, storage class map, and Ingress/DNS steps.
- GitOps still owns steady-state manifests; Velero complements, not replaces, Git.
Troubleshooting playbook
- Backup PartiallyFailed —
velero backup logs <name>; check volume snapshot permissions, pod volume backup timeouts, or excluded resources. - Volume snapshot errors — Verify VSL region, CSI driver, and that PVC is Bound; IAM
ec2:CreateSnapshot(AWS) or equivalent. - FS backup stuck — Node agent not on node; Pod has restrictive affinity; insufficient disk; annotation typo on volume names.
- Restore errors on webhooks — Temporarily relax validating webhook or restore CRDs first; use
--preserve-nodeportsflags only when understood. - Empty bucket prefix — Wrong BSL credentials or region; path-style vs virtual-host S3 config.
- Plugin mismatch — Server image and
velero-plugin-for-*version incompatibility after upgrade.
kubectl get backup,restore,schedule -n velero
velero backup describe <name> --details
velero backup logs <name>
kubectl logs -n velero deploy/velero -c velero --tail=200
kubectl logs -n velero daemonset/node-agent --tail=100
For general Pod and PVC debugging after restore, use the Kubernetes troubleshooting playbook.
Common pitfalls
- Assuming Velero backs up etcd — It backs up API objects you scope; cluster membership and some control-plane state are out of band on managed K8s.
- Never testing restore — Backups without tested restores are wishful thinking.
- Backing up everything including kube-system — Bloated backups and restore conflicts; be deliberate.
- Ignoring CRD/operator order — Restore app CRs before operator is running → failed restores.
- Cross-cloud restore without planning — Snapshots are provider-specific; FS backups help portability but need compatible storage.
- Secrets in object storage without bucket hardening — Equivalent to leaking kube Secrets.
- Relying on Velero alone for Postgres — Combine hooks, logical backup, or managed DB PITR.
How Velero fits your platform stack
- GitOps (Argo CD / Flux) — Manifests in Git; Velero for PVC data and disaster rewind.
- Multi-cluster (EKS/GKE/AKS) — Same BSL pattern per environment; replication for standby region.
- Karpenter / cluster autoscaler — Unrelated to backup, but restores may spike Pending Pods until nodes exist—plan capacity for DR exercises.
- Terraform — Provisions buckets, IAM, and optionally Helm release for Velero; keep backup bucket outside cluster lifecycle—see Terraform IaC for everyone.
Velero is graduated in the CNCF—it is the default open-source answer for Kubernetes backup and migration. The detail that sticks: scope your backups, prove your restores, and treat object storage as part of your security boundary.
Further reading
- Velero documentation — install, backup, restore, plugins
- Velero GitHub repository — releases and issue tracker
- Backup API type — spec and status fields
- CNCF Velero project page
- Kubernetes volumes — PVC/PV model Velero protects
Blog index · Kubernetes architecture · Storage (PV/PVC) · CRI and CSI · Cluster RBAC · Incident response · S3 in depth