Platform & Kubernetes · 22 May 2026 · Guide · By Babulal Tamang

HPA
Kubernetes
Autoscaling
Manifests
Platform

Kubernetes HPA in Depth: Manifest Anatomy and Scaling Behavior

The Horizontal Pod Autoscaler (HPA) is the built-in Kubernetes object that changes replica count for a workload based on metrics—usually CPU or memory, but also custom and external signals when you wire a metrics pipeline. This guide walks the full autoscaling/v2 manifest field by field, with emphasis on behavior: how scale-up and scale-down are rate-limited, stabilized, and combined so clusters do not flap.

In short

Declare an HPA pointing at a Deployment (or similar) → set minReplicas/maxReplicas and one or more metrics → optional behavior tunes how fast replicas change → the HPA controller in kube-controller-manager reconciles every ~15s and patches spec.replicas. You need resource requests on Pods and metrics-server (or equivalent) for resource metrics; tune behavior when scale-in is too aggressive or scale-out lags traffic spikes.

What the HPA does (and does not)

The HPA answers one question: how many Pods should this workload run right now? It does not change CPU limits, image tags, or node size—that is VPA (vertical) and cluster/node autoscalers (capacity). It scales horizontally by editing the scale subresource of supported targets:

Deployment, StatefulSet, ReplicaSet (via owner)
ReplicationController (legacy)
Custom resources that implement scale subresources (e.g. some operators)

Each reconciliation cycle the controller:

Reads current replica count and metric values from the metrics APIs.
Computes a desired replica count per metric, then typically takes the maximum across metrics (most demanding signal wins).
Applies behavior policies so the allowed change this cycle may be smaller than the raw desired delta.
Patches the target’s spec.replicas (clamped to minReplicas and maxReplicas).

For cluster fundamentals, see Kubernetes architecture in simple terms. For requests, limits, and probes that HPA still depends on, see Kubernetes hands-on: day-one practices.

HPA vs VPA vs KEDA vs node autoscalers

Mechanism	What changes	Typical signal	Notes
HPA	Pod replica count	CPU, memory, custom, external metrics	Native; you own the manifest
VPA	Container CPU/memory requests (and sometimes limits)	Historical usage	Often disabled alongside HPA on same workload—conflicting controllers
KEDA	Pod replicas via managed HPA	Queue lag, Prometheus, cloud metrics, cron	See KEDA in depth—do not hand-edit KEDA-owned HPAs
Cluster Autoscaler / Karpenter	Nodes	Pending Pods, utilization	Downstream—HPA adds Pods; something must schedule them (Karpenter in depth)

Prerequisites teams skip

Resource requests on every container in the Pod template. Without requests, CPU utilization metrics are undefined or misleading and the scheduler cannot place scaled-out Pods reliably.
metrics-server (or compatible metrics pipeline) in the cluster for Resource metrics. Verify: kubectl top pods works in the namespace—see metrics-server in depth.
API aggregation for custom.metrics.k8s.io and external.metrics.k8s.io if you use Pod/Object/External metric types (Prometheus Adapter, KEDA adapter, etc.).
RBAC — HPA controller runs as system; your team needs read access to HPA status and metrics for debugging (cluster RBAC).

Reference manifest: every top-level field

Production clusters should use autoscaling/v2 (stable since Kubernetes 1.23). autoscaling/v2beta2 is removed in modern versions; autoscaling/v1 only supports CPU on Deployments/ReplicationControllers.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway
  namespace: production
  labels:
    app: api-gateway
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      selectPolicy: Min
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
status:
  # Populated by controller — see "Status and observability"

`metadata`

Name and namespace must be unique. Labels are optional but help GitOps and dashboards. The HPA is usually in the same namespace as the workload it scales.

`spec.scaleTargetRef`

Points at the scalable resource. Required fields:

apiVersion — e.g. apps/v1
kind — e.g. Deployment
name — must match the workload

The HPA controller resolves the scale subresource. If you reference a Deployment with no Pods yet, the HPA still exists but may report ScalingActive: False until replicas exist.

`spec.minReplicas` and `spec.maxReplicas`

maxReplicas — Required ceiling. Set from capacity planning, cost caps, and downstream limits (DB connections, license seats).
minReplicas — Optional floor; defaults to 1 if omitted. Use minReplicas: 2 for HA even at low traffic. Built-in HPA cannot scale to zero; that requires KEDA or manual scale.

Initial Deployment spec.replicas can differ from HPA bounds; on first sync the HPA will move replicas toward the computed desired value within [min, max].

Metrics: `spec.metrics[]` anatomy

Each entry has type and a type-specific block. With multiple metrics, desired replicas are derived per metric and the HPA uses the maximum desired count (unless you use complex object metrics with selectors—still “worst case wins” for capacity).

Resource metrics (`type: Resource`)

Most common: cluster-wide average of Pod resource usage vs target. Supported names: cpu, memory.

target.type options:

Utilization — Percentage of the requested resource across Pods (CPU: millicores used / millicores requested; memory: bytes used / bytes requested). Requires requests on containers.
AverageValue — Absolute average per Pod (e.g. 500m CPU or 512Mi memory).

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70   # scale out when avg CPU > 70% of request

Utilization formula (per metric):

desiredReplicas = ceil( currentReplicas × ( currentMetricValue / targetValue ) )

Example: 4 replicas, average CPU utilization 140% of target (70% goal) → ceil(4 × 140/70) = 8 desired replicas.

Container resource metrics (`type: ContainerResource`)

Same as Resource but scoped to one container in the Pod template (sidecars excluded from skewing app CPU).

metrics:
  - type: ContainerResource
    containerResource:
      name: cpu
      container: app
      target:
        type: Utilization
        averageUtilization: 75

Pods metrics (`type: Pods`)

Average value of a metric across Pods, from custom.metrics.k8s.io. Example: HTTP requests per second per Pod from Prometheus Adapter.

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Object metrics (`type: Object`)

A metric describing a single Kubernetes object (often an Ingress or Service), still via custom metrics API. Target can be Value or AverageValue depending on whether the metric is absolute or per-Pod average.

External metrics (`type: External`)

Cluster-scoped metrics from external.metrics.k8s.io—cloud queue depth, global lag, etc. KEDA’s adapter registers here. Target is usually AverageValue or Value.

metrics:
  - type: External
    external:
      metric:
        name: sqs_approximate_number_of_messages_visible
        selector:
          matchLabels:
            queue: orders
      target:
        type: AverageValue
        averageValue: "30"

Metric status and tolerance

The controller applies a default tolerance (typically 10%) so tiny metric jitter does not change replicas. Metrics within tolerance of the target leave replica count unchanged. Custom tolerance per metric is not in the stable HPA spec—behavior stabilization handles most flapping instead.

`spec.behavior`: scale-up and scale-down in detail

Before autoscaling/v2, Kubernetes scaled as fast as the raw formula allowed—often causing flapping (rapid up/down) and aggressive scale-in during brief lulls. behavior adds two layers of control:

Stabilization window — Remember recent recommendations and pick a conservative value before applying policies.
Policies — Cap how many Pods or what percent of current replicas may change per periodSeconds.

If behavior is omitted, Kubernetes uses defaults (as of 1.27+): scale-down stabilization 300s; scale-up stabilization 0s; scale-up allows 100% or 4 Pods per 15s (Max policy); scale-down allows 100% or 4 Pods per 15s (Min policy). Always verify defaults for your cluster minor version in the official HPA docs.

Structure: `scaleUp` and `scaleDown`

Each side supports:

stabilizationWindowSeconds — How long (seconds) to look back at past desired replica recommendations. For scale-down, a longer window (e.g. 300) prevents cutting replicas during short dips. For scale-up, 0 means react immediately to spikes (default).
policies[] — Rate limits. Each policy has:
- type: Percent — Max percent of current replicas to add/remove in the period.
- type: Pods — Max absolute Pod count to add/remove in the period.
- value — The number for that type (percent or pods).
- periodSeconds — Rolling window length for that policy.
selectPolicy — When multiple policies apply: Max (allow the largest change—typical for scale-up), Min (smallest change—typical for scale-down), or Disabled (ignore policies for that direction; only stabilization applies).

How the controller applies behavior (mental model)

Raw metric formula → desiredReplicas (per metric, take max)
        ↓
Stabilization window → stabilizedRecommendation
        ↓
Policies (Percent + Pods) → allowed change this sync
        ↓
currentReplicas ± allowed → patch scaleTargetRef (clamp min/max)

Scale-up example: Current 10 replicas; raw desired 18; policies allow max +4 Pods per 15s → this cycle only goes to 14, not 18. Next cycles can continue up if metrics stay hot.

Scale-down example: Raw desired 3; stabilization window still “remembers” recommendations of 8–10 for 300s → effective recommendation stays higher; policies then remove at most 10% or 2 Pods per minute → gradual, safer scale-in.

Recommended patterns

Workload	scaleUp	scaleDown
Latency-sensitive API	Fast: low `periodSeconds`, `selectPolicy: Max`, stabilization 0	Slow: `stabilizationWindowSeconds: 300–600`, small Percent/Pods, `selectPolicy: Min`
Batch workers	Moderate caps to avoid overshooting queue drain	Longer stabilization so job batches finish
Stateful consumers	Match broker partition count and max consumer lag	Pair with graceful termination and `preStop`—HPA does not wait for in-flight work

Conservative scale-down manifest

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600
    selectPolicy: Min
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      - type: Pods
        value: 1
        periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0
    selectPolicy: Max
    policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 5
        periodSeconds: 30

This pattern keeps capacity during noisy metrics but still allows burst scale-out. Tune from load tests—not guesses.

Status and observability

status is read-only and essential for debugging:

currentReplicas / desiredReplicas — What the controller sees vs wants.
currentMetrics — Last read values per metric.
conditions — e.g. AbleToScale, ScalingActive, ScalingLimited (at max/min).
lastScaleTime — When replica count last changed.

kubectl get hpa -n production
kubectl describe hpa api-gateway -n production
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods" | head

Events on the HPA object often say failed to get cpu utilization (missing requests or metrics-server) or the HPA was unable to compute the replica count (unknown metric).

Worked examples

CPU-only Deployment HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

Deployment must set container requests, e.g. resources.requests.cpu: 250m. Without them, HPA stays idle or mis-scales.

CPU + memory (either can drive scale-out)

With both metrics, whichever implies more replicas wins—useful when memory spikes without CPU (caches, JVM heap).

Prometheus custom metric (Pods type)

Requires Prometheus Adapter (or equivalent) registering http_requests_per_second. Validate metric exists before applying HPA:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

autoscaling/v1 vs v2

Feature	v1	v2
CPU target	`targetCPUUtilizationPercentage`	`metrics[]` Resource
Memory / custom / external	No	Yes
`behavior`	No	Yes
Container-scoped CPU	No	`ContainerResource`

Migrate by converting to autoscaling/v2 and moving CPU into metrics; add behavior when replacing informal “wait 5 minutes before scale down” runbooks.

Interaction with rollouts, PDBs, and quotas

Rolling updates — HPA may change replicas during a rollout; combine with sensible maxSurge/maxUnavailable so you do not violate availability.
PodDisruptionBudgets — Scale-in respects PDBs only indirectly via eviction API; aggressive scale-down plus tight PDB can slow node drains elsewhere.
ResourceQuota — Scale-out stops at Pending Pods if namespace quota blocks new Pods—looks like “HPA stuck at N” but root cause is quota.
Cluster capacity — Desired 30 replicas with only 20 schedulable slots → Pending Pods; fix nodes (Karpenter) or quotas, not HPA YAML alone.

Production checklist

Check	Why
Requests (and limits) set per container	Utilization math and scheduling
`maxReplicas` aligned with dependency limits	DB, cache, API keys
`minReplicas` for HA and cold-start budget	Avoid single-Pod failure domain
`behavior.scaleDown` tuned for your traffic shape	Prevents customer-facing scale-in shocks
Load test at expected RPS with HPA enabled	Validates targets and stabilization
Alerts on `ScalingLimited` at max	Capacity or mis-set threshold
Graceful shutdown for scale-in	`terminationGracePeriodSeconds`, `preStop`
HPA in GitOps; no manual `kubectl autoscale` drift	See GitOps principles

Troubleshooting playbook

ScalingActive: False — Missing metrics-server; no Pods; wrong scaleTargetRef; metric name typo.
unable to get metrics — Install/fix metrics-server; check APIService; for custom metrics verify adapter pods.
Replicas never exceed min — Utilization below target; wrong container requests (too high → never “looks busy”); metric tolerance band.
Replicas pinned at max — Target too low; traffic sustained; need higher maxReplicas or optimize app.
Flapping — Increase scale-down stabilization; tighten scale-down policies; review CPU throttling vs requests.
Slow scale-out — Reduce scale-up periodSeconds; increase Percent/Pods policy; check image pull and scheduling latency.
Pending after scale-out — Nodes, affinity, quotas—troubleshooting playbook.

Common pitfalls

HPA on Deployments without CPU requests — #1 production mistake.
Using limits instead of requests for “sizing” — HPA utilization is vs requests, not limits.
Sidecar CPU drowning app signal — Use ContainerResource on the app container.
Same workload with VPA and HPA — Fighting controllers; pick one or use VPA “Off” recommendation mode only.
Editing KEDA-managed HPAs — Changes reconcile away; edit ScaledObject.
No behavior on bursty APIs — Scale-down defaults may be too eager for your SLO.
maxReplicas = 100 without load test — Cost and thundering herd on dependencies.

Blog