Karpenter in Depth: Node Provisioning Built Around the Scheduler

Karpenter is an open-source Kubernetes node autoscaler that provisions compute in seconds—not by resizing fixed Auto Scaling Groups, but by watching unschedulable Pods, simulating what the scheduler needs, and launching right-sized instances that match those constraints. It is the tool many EKS and multi-cloud teams adopt when Cluster Autoscaler’s “one instance type per node group” model leaves money on the table and Pods stuck in Pending for minutes.

In short

Karpenter extends the scheduling loop: unschedulable Pod → provisioner picks instance type + zone → cloud API launches → node joins → Pod binds. You declare capacity with NodePools and cloud settings with provider classes (e.g. EC2NodeClass on AWS). Consolidation removes or replaces underused nodes. It complements KEDA (workload scale) and replaces much of what teams used Cluster Autoscaler for on elastic node fleets.

The problem Karpenter solves

Kubernetes schedules Pods onto nodes that already exist. When no node has enough CPU, memory, GPUs, or matching labels/taints, the Pod stays Pending with events like 0/12 nodes are available: 3 Insufficient memory, 2 node(s) didn't match Pod's node affinity. Something must add capacity.

For years that “something” was often the Cluster Autoscaler (CA): it watches Pending Pods and scales predefined node groups (ASGs on AWS, MIGs on GCP, VMSS on Azure). CA is battle-tested, but its model has friction:

  • Homogeneous groups — Each node group is usually one instance family/size; CA scales the whole group even when one Pod only needs 2 vCPU.
  • Slow scale-out — ASG launch + bootstrap + kubelet register can take several minutes; burst traffic waits.
  • Operational overhead — Teams maintain many node groups (on-demand, spot, GPU, arm64) and still hit mismatches when Pod specs change.
  • Weak bin-packing on scale-down — CA evicts and drains; it does not re-architect the fleet around workload shape the way a scheduler-aware provisioner can.

Karpenter, originally from AWS and now a CNCF project, flips the design: treat node provisioning as a scheduling decision first and an infrastructure API call second. The controller asks, “What machine would make this Pod schedulable?”—then creates that machine.

Karpenter vs Cluster Autoscaler vs KEDA

These tools are complementary; confusing them causes architecture mistakes.

Tool Scales Trigger Typical owner
Karpenter Nodes (machines) Unschedulable Pods; consolidation policy on idle capacity Platform / cluster team
Cluster Autoscaler Node groups (pools of similar nodes) Pending Pods + utilization thresholds per group Platform team (legacy or regulated environments)
KEDA Workload replicas (Deployments, Jobs, etc.) External metrics (queue depth, Prometheus, cloud monitors) Application / service team
HPA Replicas from CPU/memory/custom metrics Resource or custom metrics API Application team

Common pattern: KEDA or HPA scales Pods up → cluster needs more nodes → Karpenter provisions them. Scale down: HPA/KEDA shrinks Pods → Karpenter consolidates empty or fragmented nodes. For architecture context, see Kubernetes architecture in simple terms and the troubleshooting playbook when Pods stay Pending after nodes exist.

Architecture: control plane in the cluster, feet in the cloud

Karpenter runs as one or more controller Pods (typically in karpenter namespace) with credentials to your cloud provider. It does not replace the Kubernetes scheduler; it feeds it new Nodes.

  1. Watch — Informer caches Pods, Nodes, NodePools, NodeClaims, and provider-specific classes.
  2. Detect unschedulable Pods — Pods the default scheduler cannot place (often with condition PodScheduled=False, reason Unschedulable).
  3. Simulate — Karpenter’s scheduling logic (conceptually aligned with kube-scheduler constraints) evaluates which NodePool could host the Pod and which instance types satisfy requests, limits, affinity, topology spread, and taints/tolerations.
  4. Provision — Creates a NodeClaim (desired node) and calls the cloud API (EC2 RunInstances, etc.) with a launch template or equivalent derived from EC2NodeClass.
  5. Register — New instance joins the cluster (EKS: via aws-auth / access entries or node IAM role; userData/bootstrap starts kubelet). Node becomes Ready.
  6. Bind — Default scheduler places Pending Pods onto the new Node.

Latency-sensitive teams often see node ready in under a minute on AWS when AMIs, subnets, and IAM are pre-warmed—far faster than adding capacity to a large ASG with a full launch template pipeline on every burst.

Core APIs (Karpenter v1)

Karpenter v1 stabilized CRDs around NodePool and provider-specific node classes. Older clusters may still reference Provisioner and Machine (v1beta1); new installs should use v1.

NodePool — who may be provisioned and how

A NodePool is the main policy object: template for nodes Karpenter may create, plus limits and disruption behavior.

  • Template specnodeClassRef (points to EC2NodeClass), requirements (instance types, zones, arch, capacity type spot/on-demand), labels, annotations, taints.
  • Limitscpu, memory, or nodes caps so one noisy namespace cannot bankrupt the account.
  • DisruptionconsolidationPolicy (WhenEmpty, WhenEmptyOrUnderutilized), consolidateAfter delay, budgets for max concurrent node disruption.
  • Weight — When multiple NodePools match, higher weight wins (useful for preferring spot vs on-demand fallbacks).

EC2NodeClass — AWS substrate

On EKS, EC2NodeClass binds Kubernetes to AWS: AMI selection (AL2023, Bottlerocket), subnet selector, security group selector, instance profile / IAM role, metadata options, block devices, and optional userData. Karpenter resolves selectors against tags (e.g. karpenter.sh/discovery: my-cluster) so you do not hard-code subnet IDs in ten places.

NodeClaim — one requested node

Each provisioning action materializes as a NodeClaim (name often like default-abc12). It tracks lifecycle: created → launched → registered → ready → terminating. Operators debugging scale-out should kubectl describe nodeclaim alongside Karpenter controller logs.

Minimal EKS example

Illustrative manifests—adjust ARNs, discovery tags, and cluster name for your account.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  role: KarpenterNodeRole-my-cluster
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  metadataOptions:
    httpEndpoint: enabled
    httpTokens: required
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.large", "m6i.xlarge", "m7i.large", "c6i.large"]
      expireAfter: 720h
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
    budgets:
      - nodes: "10%"

A Deployment with unschedulable Pods and correct resource requests triggers provisioning. Without requests, Karpenter and the scheduler guess poorly—see day-one best practices.

How instance type selection works

Karpenter evaluates a flexible set of instance types allowed in the NodePool (or infers from constraints). It scores options by:

  • Schedulability — Can all pending Pods that triggered provisioning fit, including init containers, overhead, and daemonsets?
  • ConstraintsnodeSelector, affinity, anti-affinity, topology spread, resource requests, PVC topology (zone), GPU requests.
  • Price and priority — Among valid options, prefer cheaper types; capacity-type: spot when allowed, with on-demand fallback if spot unavailable.
  • Bin-packing — Often picks a size that minimizes waste but avoids fragmentation that blocks later Pods.

You can narrow or widen the search space:

  • Explicit allow-listnode.kubernetes.io/instance-type In [...] for compliance or reserved capacity programs.
  • Category ruleskarpenter.k8s.aws/instance-category In [c,m,r] and instance-generation > 5 instead of enumerating every SKU.
  • Architecturekubernetes.io/arch In [arm64] for Graviton fleets.

This is where FinOps meets operations: the same cluster can run mostly spot and right-sized nodes without maintaining separate ASGs per size. Pair with tagging and cost allocation from FinOps in plain English so savings show up in reports.

Consolidation: scale-in that understands utilization

Provisioning is half the story. Karpenter consolidation continuously tries to reduce fleet cost by:

  • Removing empty nodes — After consolidateAfter, delete nodes with no non-DaemonSet Pods.
  • Replacing underutilized nodes — Move Pods to fewer or cheaper nodes (“drift” / multi-node consolidation), respecting PodDisruptionBudgets and budgets.

Disruption budgets (spec.disruption.budgets) cap how many nodes may be disrupted at once—critical during business hours. Tune consolidateAfter: too aggressive causes churn; too slow leaves expensive air in the cluster.

PodDisruptionBudgets (PDBs) still matter: consolidation respects minAvailable / maxUnavailable. Missing PDBs on stateful services is a common source of “why did Karpenter restart my node during consolidation?” incidents.

Scheduling integration details operators should know

Resource requests are non-negotiable

Karpenter uses Pod resources.requests (not limits) for CPU/memory math. Pods with no requests may schedule on small nodes in practice but behave badly under contention and break autoscaling signals. Set requests per container; use LimitRanges or policy (OPA/Kyverno) in production.

Taints, tolerations, and dedicated pools

NodePool template taints isolate workloads—GPU, batch, system. Example: GPU pool with nvidia.com/gpu=true:NoSchedule and a matching NodePool requirements block for GPU instance types. Pods need tolerations and often node affinity.

Topology and storage

Zone spread constraints and EBS volumes tie Pods to zones. Karpenter must launch nodes in zones where scheduling is feasible; PVCs with volumeBindingMode: WaitForFirstConsumer interact with this—see Kubernetes storage (PV, PVC, StorageClass).

DaemonSets and overhead

Every node runs DaemonSets (CNI, kube-proxy, monitoring agents). Karpenter accounts for that overhead when sizing nodes—another reason to keep DaemonSet resource requests honest.

AWS / EKS installation checklist

  1. Tags — Subnets and security groups tagged karpenter.sh/discovery: <cluster-name> (or your chosen discovery key).
  2. Node IAM role — EC2 instances assume a role with EKS worker policies; trust policy allows EC2.
  3. Karpenter controller IAM — IRSA role with scoped EC2, SSM, pricing, and PassRole permissions (use upstream policy documents; avoid AdministratorAccess).
  4. Interruption handling — SQS queue for spot interruption, scheduled maintenance, and instance termination signals so Karpenter can cordon/drain proactively.
  5. Helm install — Official chart with cluster name, endpoint, interruption queue name; pin versions in GitOps—see GitOps best principles.
  6. Remove or narrow CA — Running Karpenter and Cluster Autoscaler on the same node groups fights; migrate node groups to Karpenter-managed capacity.

On EKS, prefer EKS access entries over legacy aws-auth ConfigMap where possible; Karpenter docs track both paths.

Spot, interruption, and availability

Spot saves money; interruption is a feature, not an accident. Karpenter:

  • Provisions spot when NodePool allows karpenter.sh/capacity-type: spot.
  • Receives 2-minute style warnings via AWS interruption queues and begins draining.
  • Falls back to on-demand when spot capacity is unavailable (if both are allowed).

Run fault-tolerant workloads on spot; keep strict SLAs on on-demand NodePools with higher weight or dedicated pools. Combine with PDBs and sensible terminationGracePeriodSeconds.

Multi-NodePool patterns

Pattern Configuration sketch
Spot-first, on-demand safety net Two NodePools; spot pool higher weight; on-demand pool same classes, lower weight
Graviton (arm64) savings kubernetes.io/arch In [arm64]; ensure images support arm64
GPU / accelerators Separate NodePool + taints; instance-type or category rules for g families
Batch / CI burst High CPU limits, short expireAfter, aggressive consolidation
Regulated / static footprint Tight instance allow-list, on-demand only, low disruption budgets

Observability and debugging

When Pods sit Pending, walk this order:

  1. kubectl describe pod <name> -n <ns> — scheduler events (affinity, resources, taints).
  2. kubectl get nodeclaims and kubectl describe nodeclaim <name> — provisioning errors (limits, IAM, subnet, launch failure).
  3. kubectl logs -n karpenter deploy/karpenter -f — reconciliation errors, consolidation skips.
  4. kubectl get nodepools — verify limits not exhausted (cpu at cap shows in status).
  5. AWS EC2 console / CloudTrail — RunInstances denied, insufficient capacity, spot fulfillment.

Export Prometheus metrics from Karpenter (provisioning duration, schedulable simulation failures, interruption counts) and alert on sustained Pending Pods with no NodeClaims. Tie node count and instance type mix to cost dashboards.

Security and governance

  • Least-privilege IAM — Controller role scoped to cluster tags; node role standard worker permissions.
  • IMDSv2 — Require httpTokens: required on EC2NodeClass.
  • Network policy — Controller talks to Kubernetes API and AWS; no need for wide inbound on nodes.
  • Policy-as-code — Kyverno/OPA to require labels, forbid overly broad instance-type In lists, or mandate limits on NodePools.
  • Multi-tenancy — Separate NodePools per team with limits; use ResourceQuota on namespaces for Pod count.

RBAC for who may edit NodePools should be tight—misconfiguration can launch expensive SKUs. See Kubernetes cluster RBAC.

Production pitfalls

  • No resource requests — Autoscaling and scheduling both degrade; appears as “Karpenter is broken.”
  • CA and Karpenter together — Double scaling or fighting over the same ASG labels.
  • Missing discovery tags — Subnets/SGs not tagged → no launch targets.
  • Over-broad instance requirements — Allowing every generation increases IAM policy surface and complicates support; prefer categories + generations.
  • No limits on NodePool — Runaway spend during attacks or misconfigured HPA max replicas.
  • Ignoring PDBs — Consolidation or termination surprises during deploys.
  • Hard-coded AMIs without drift plan — Use alias selectors or versioned AMIs with tested upgrades.
  • Zone imbalance — All subnets in one AZ for “simplicity” → AZ failure takes the pool.

Migration from Cluster Autoscaler

A pragmatic path:

  1. Install Karpenter alongside; create NodePools + EC2NodeClass mirroring one existing node group’s subnets/SGs/role.
  2. Cordon old CA-managed nodes; drain workloads to Karpenter nodes (or let natural churn replace).
  3. Disable CA on migrated groups; remove node group autoscaling tags CA depends on.
  4. Expand NodePools (spot, arm64, GPU); tune consolidation after observing PDB and workload behavior.

Keep CA only where required (non-Karpenter-supported providers, strict change-control windows, or static VMware/bare-metal pools).

Other clouds and the future

Karpenter’s provider model extends beyond AWS: community and vendor implementations target Azure, GCP, and others with provider-specific node classes. The mental model—NodePool + NodeClaim + consolidation—stays consistent; only the cloud binding changes.

Upstream work continues on capacity reservations, reserved instances awareness, finer-grained interruption prediction, and tighter integration with workload autoscaling signals. Even if APIs evolve, the principle remains: provision from workload shape, not from a fixed menu of node groups.

How this fits your platform stack

Karpenter sits beside GitOps-delivered addons (GitOps principles), Terraform-built networks (Terraform & IaC), and application scaling (KEDA/HPA). It is one of the highest-leverage knobs for cost, speed, and resilience on Kubernetes—because every Deploy, Job, and ML training run ultimately lands on a node something must create.

If you operate EKS today with three ASGs and frequent Pending Pods, Karpenter is less a trendy swap than a scheduling-aligned autoscaler worth a disciplined pilot: one NodePool, measured consolidation, clear limits, and observability before you bet the production fleet on it.

Further reading

Blog index · Kubernetes architecture · CRI and CSI · Troubleshooting playbook · GitOps principles · FinOps

Back to blog list