Kubernetes HPA in Depth: Manifest Anatomy and Scaling Behavior
The Horizontal Pod Autoscaler (HPA) is the built-in Kubernetes object that changes replica count for a workload based on metrics—usually CPU or memory, but also custom and external signals when you wire a metrics pipeline. This guide walks the full autoscaling/v2 manifest field by field, with emphasis on behavior: how scale-up and scale-down are rate-limited, stabilized, and combined so clusters do not flap.
In short
Declare an HPA pointing at a Deployment (or similar) → set minReplicas/maxReplicas and one or more metrics → optional behavior tunes how fast replicas change → the HPA controller in kube-controller-manager reconciles every ~15s and patches spec.replicas. You need resource requests on Pods and metrics-server (or equivalent) for resource metrics; tune behavior when scale-in is too aggressive or scale-out lags traffic spikes.
What the HPA does (and does not)
The HPA answers one question: how many Pods should this workload run right now? It does not change CPU limits, image tags, or node size—that is VPA (vertical) and cluster/node autoscalers (capacity). It scales horizontally by editing the scale subresource of supported targets:
Deployment,StatefulSet,ReplicaSet(via owner)ReplicationController(legacy)- Custom resources that implement scale subresources (e.g. some operators)
Each reconciliation cycle the controller:
- Reads current replica count and metric values from the metrics APIs.
- Computes a desired replica count per metric, then typically takes the maximum across metrics (most demanding signal wins).
- Applies
behaviorpolicies so the allowed change this cycle may be smaller than the raw desired delta. - Patches the target’s
spec.replicas(clamped tominReplicasandmaxReplicas).
For cluster fundamentals, see Kubernetes architecture in simple terms. For requests, limits, and probes that HPA still depends on, see Kubernetes hands-on: day-one practices.
HPA vs VPA vs KEDA vs node autoscalers
| Mechanism | What changes | Typical signal | Notes |
|---|---|---|---|
| HPA | Pod replica count | CPU, memory, custom, external metrics | Native; you own the manifest |
| VPA | Container CPU/memory requests (and sometimes limits) | Historical usage | Often disabled alongside HPA on same workload—conflicting controllers |
| KEDA | Pod replicas via managed HPA | Queue lag, Prometheus, cloud metrics, cron | See KEDA in depth—do not hand-edit KEDA-owned HPAs |
| Cluster Autoscaler / Karpenter | Nodes | Pending Pods, utilization | Downstream—HPA adds Pods; something must schedule them (Karpenter in depth) |
Prerequisites teams skip
- Resource requests on every container in the Pod template. Without requests, CPU utilization metrics are undefined or misleading and the scheduler cannot place scaled-out Pods reliably.
- metrics-server (or compatible metrics pipeline) in the cluster for
Resourcemetrics. Verify:kubectl top podsworks in the namespace—see metrics-server in depth. - API aggregation for
custom.metrics.k8s.ioandexternal.metrics.k8s.ioif you use Pod/Object/External metric types (Prometheus Adapter, KEDA adapter, etc.). - RBAC — HPA controller runs as system; your team needs read access to HPA status and metrics for debugging (cluster RBAC).
Reference manifest: every top-level field
Production clusters should use autoscaling/v2 (stable since Kubernetes 1.23). autoscaling/v2beta2 is removed in modern versions; autoscaling/v1 only supports CPU on Deployments/ReplicationControllers.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway
namespace: production
labels:
app: api-gateway
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
status:
# Populated by controller — see "Status and observability"
metadata
Name and namespace must be unique. Labels are optional but help GitOps and dashboards. The HPA is usually in the same namespace as the workload it scales.
spec.scaleTargetRef
Points at the scalable resource. Required fields:
apiVersion— e.g.apps/v1kind— e.g.Deploymentname— must match the workload
The HPA controller resolves the scale subresource. If you reference a Deployment with no Pods yet, the HPA still exists but may report ScalingActive: False until replicas exist.
spec.minReplicas and spec.maxReplicas
maxReplicas— Required ceiling. Set from capacity planning, cost caps, and downstream limits (DB connections, license seats).minReplicas— Optional floor; defaults to 1 if omitted. UseminReplicas: 2for HA even at low traffic. Built-in HPA cannot scale to zero; that requires KEDA or manual scale.
Initial Deployment spec.replicas can differ from HPA bounds; on first sync the HPA will move replicas toward the computed desired value within [min, max].
Metrics: spec.metrics[] anatomy
Each entry has type and a type-specific block. With multiple metrics, desired replicas are derived per metric and the HPA uses the maximum desired count (unless you use complex object metrics with selectors—still “worst case wins” for capacity).
Resource metrics (type: Resource)
Most common: cluster-wide average of Pod resource usage vs target. Supported names: cpu, memory.
target.type options:
Utilization— Percentage of the requested resource across Pods (CPU: millicores used / millicores requested; memory: bytes used / bytes requested). Requires requests on containers.AverageValue— Absolute average per Pod (e.g.500mCPU or512Mimemory).
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale out when avg CPU > 70% of request
Utilization formula (per metric):
desiredReplicas = ceil( currentReplicas × ( currentMetricValue / targetValue ) )
Example: 4 replicas, average CPU utilization 140% of target (70% goal) → ceil(4 × 140/70) = 8 desired replicas.
Container resource metrics (type: ContainerResource)
Same as Resource but scoped to one container in the Pod template (sidecars excluded from skewing app CPU).
metrics:
- type: ContainerResource
containerResource:
name: cpu
container: app
target:
type: Utilization
averageUtilization: 75
Pods metrics (type: Pods)
Average value of a metric across Pods, from custom.metrics.k8s.io. Example: HTTP requests per second per Pod from Prometheus Adapter.
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Object metrics (type: Object)
A metric describing a single Kubernetes object (often an Ingress or Service), still via custom metrics API. Target can be Value or AverageValue depending on whether the metric is absolute or per-Pod average.
External metrics (type: External)
Cluster-scoped metrics from external.metrics.k8s.io—cloud queue depth, global lag, etc. KEDA’s adapter registers here. Target is usually AverageValue or Value.
metrics:
- type: External
external:
metric:
name: sqs_approximate_number_of_messages_visible
selector:
matchLabels:
queue: orders
target:
type: AverageValue
averageValue: "30"
Metric status and tolerance
The controller applies a default tolerance (typically 10%) so tiny metric jitter does not change replicas. Metrics within tolerance of the target leave replica count unchanged. Custom tolerance per metric is not in the stable HPA spec—behavior stabilization handles most flapping instead.
spec.behavior: scale-up and scale-down in detail
Before autoscaling/v2, Kubernetes scaled as fast as the raw formula allowed—often causing flapping (rapid up/down) and aggressive scale-in during brief lulls. behavior adds two layers of control:
- Stabilization window — Remember recent recommendations and pick a conservative value before applying policies.
- Policies — Cap how many Pods or what percent of current replicas may change per
periodSeconds.
If behavior is omitted, Kubernetes uses defaults (as of 1.27+): scale-down stabilization 300s; scale-up stabilization 0s; scale-up allows 100% or 4 Pods per 15s (Max policy); scale-down allows 100% or 4 Pods per 15s (Min policy). Always verify defaults for your cluster minor version in the official HPA docs.
Structure: scaleUp and scaleDown
Each side supports:
stabilizationWindowSeconds— How long (seconds) to look back at past desired replica recommendations. For scale-down, a longer window (e.g. 300) prevents cutting replicas during short dips. For scale-up,0means react immediately to spikes (default).policies[]— Rate limits. Each policy has:type: Percent— Max percent of current replicas to add/remove in the period.type: Pods— Max absolute Pod count to add/remove in the period.value— The number for that type (percent or pods).periodSeconds— Rolling window length for that policy.
selectPolicy— When multiple policies apply:Max(allow the largest change—typical for scale-up),Min(smallest change—typical for scale-down), orDisabled(ignore policies for that direction; only stabilization applies).
How the controller applies behavior (mental model)
Raw metric formula → desiredReplicas (per metric, take max)
↓
Stabilization window → stabilizedRecommendation
↓
Policies (Percent + Pods) → allowed change this sync
↓
currentReplicas ± allowed → patch scaleTargetRef (clamp min/max)
Scale-up example: Current 10 replicas; raw desired 18; policies allow max +4 Pods per 15s → this cycle only goes to 14, not 18. Next cycles can continue up if metrics stay hot.
Scale-down example: Raw desired 3; stabilization window still “remembers” recommendations of 8–10 for 300s → effective recommendation stays higher; policies then remove at most 10% or 2 Pods per minute → gradual, safer scale-in.
Recommended patterns
| Workload | scaleUp | scaleDown |
|---|---|---|
| Latency-sensitive API | Fast: low periodSeconds, selectPolicy: Max, stabilization 0 |
Slow: stabilizationWindowSeconds: 300–600, small Percent/Pods, selectPolicy: Min |
| Batch workers | Moderate caps to avoid overshooting queue drain | Longer stabilization so job batches finish |
| Stateful consumers | Match broker partition count and max consumer lag | Pair with graceful termination and preStop—HPA does not wait for in-flight work |
Conservative scale-down manifest
behavior:
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 1
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 5
periodSeconds: 30
This pattern keeps capacity during noisy metrics but still allows burst scale-out. Tune from load tests—not guesses.
Status and observability
status is read-only and essential for debugging:
currentReplicas/desiredReplicas— What the controller sees vs wants.currentMetrics— Last read values per metric.conditions— e.g.AbleToScale,ScalingActive,ScalingLimited(at max/min).lastScaleTime— When replica count last changed.
kubectl get hpa -n production
kubectl describe hpa api-gateway -n production
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods" | head
Events on the HPA object often say failed to get cpu utilization (missing requests or metrics-server) or the HPA was unable to compute the replica count (unknown metric).
Worked examples
CPU-only Deployment HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
Deployment must set container requests, e.g. resources.requests.cpu: 250m. Without them, HPA stays idle or mis-scales.
CPU + memory (either can drive scale-out)
With both metrics, whichever implies more replicas wins—useful when memory spikes without CPU (caches, JVM heap).
Prometheus custom metric (Pods type)
Requires Prometheus Adapter (or equivalent) registering http_requests_per_second. Validate metric exists before applying HPA:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .
autoscaling/v1 vs v2
| Feature | v1 | v2 |
|---|---|---|
| CPU target | targetCPUUtilizationPercentage |
metrics[] Resource |
| Memory / custom / external | No | Yes |
behavior |
No | Yes |
| Container-scoped CPU | No | ContainerResource |
Migrate by converting to autoscaling/v2 and moving CPU into metrics; add behavior when replacing informal “wait 5 minutes before scale down” runbooks.
Interaction with rollouts, PDBs, and quotas
- Rolling updates — HPA may change replicas during a rollout; combine with sensible
maxSurge/maxUnavailableso you do not violate availability. - PodDisruptionBudgets — Scale-in respects PDBs only indirectly via eviction API; aggressive scale-down plus tight PDB can slow node drains elsewhere.
- ResourceQuota — Scale-out stops at Pending Pods if namespace quota blocks new Pods—looks like “HPA stuck at N” but root cause is quota.
- Cluster capacity — Desired 30 replicas with only 20 schedulable slots → Pending Pods; fix nodes (Karpenter) or quotas, not HPA YAML alone.
Production checklist
| Check | Why |
|---|---|
| Requests (and limits) set per container | Utilization math and scheduling |
maxReplicas aligned with dependency limits |
DB, cache, API keys |
minReplicas for HA and cold-start budget |
Avoid single-Pod failure domain |
behavior.scaleDown tuned for your traffic shape |
Prevents customer-facing scale-in shocks |
| Load test at expected RPS with HPA enabled | Validates targets and stabilization |
Alerts on ScalingLimited at max |
Capacity or mis-set threshold |
| Graceful shutdown for scale-in | terminationGracePeriodSeconds, preStop |
HPA in GitOps; no manual kubectl autoscale drift |
See GitOps principles |
Troubleshooting playbook
ScalingActive: False— Missing metrics-server; no Pods; wrongscaleTargetRef; metric name typo.unable to get metrics— Install/fix metrics-server; check APIService; for custom metrics verify adapter pods.- Replicas never exceed min — Utilization below target; wrong container requests (too high → never “looks busy”); metric tolerance band.
- Replicas pinned at max — Target too low; traffic sustained; need higher
maxReplicasor optimize app. - Flapping — Increase scale-down stabilization; tighten scale-down policies; review CPU throttling vs requests.
- Slow scale-out — Reduce scale-up
periodSeconds; increase Percent/Pods policy; check image pull and scheduling latency. - Pending after scale-out — Nodes, affinity, quotas—troubleshooting playbook.
Common pitfalls
- HPA on Deployments without CPU requests — #1 production mistake.
- Using limits instead of requests for “sizing” — HPA utilization is vs requests, not limits.
- Sidecar CPU drowning app signal — Use
ContainerResourceon the app container. - Same workload with VPA and HPA — Fighting controllers; pick one or use VPA “Off” recommendation mode only.
- Editing KEDA-managed HPAs — Changes reconcile away; edit ScaledObject.
- No behavior on bursty APIs — Scale-down defaults may be too eager for your SLO.
- maxReplicas = 100 without load test — Cost and thundering herd on dependencies.
Further reading
- Kubernetes HPA documentation — algorithms, default behavior, API reference
- HPA v2 API reference — field-level spec
- KEDA in depth — external metrics and event-driven scaling on top of HPA
- Karpenter in depth — node capacity behind HPA scale-out
Blog index · Kubernetes architecture · KEDA in depth · Day-one practices · Troubleshooting playbook