Disruption

Understand different ways Karpenter disrupts nodes

Control Flow

Karpenter sets a Kubernetes finalizer on each node it provisions. The finalizer blocks deletion of the node object while the Termination Controller cordons and drains the node, before removing the underlying NodeClaim. Disruption is triggered by the Disruption Controller, by the user through manual disruption, or through an external system that sends a delete request to the node object.

Disruption Controller

Karpenter automatically discovers disruptable nodes and spins up replacements when needed. Karpenter disrupts nodes by executing one automatic method at a time, in order of Expiration, Drift, and then Consolidation. Each method varies slightly, but they all follow the standard disruption process:

  1. Identify a list of prioritized candidates for the disruption method.
    • If there are pods that cannot be evicted on the node, Karpenter will ignore the node and try disrupting it later.
    • If there are no disruptable nodes, continue to the next disruption method.
  2. For each disruptable node, execute a scheduling simulation with the pods on the node to find if any replacement nodes are needed.
  3. Cordon the node(s) to prevent pods from scheduling to it.
  4. Pre-spin any replacement nodes needed as calculated in Step (2), and wait for them to become ready.
    • If a replacement node fails to initialize, un-cordon the node(s), and restart from Step (1), starting at the first disruption method again.
  5. Delete the node(s) and wait for the Termination Controller to gracefully shutdown the node(s).
  6. Once the Termination Controller terminates the node, go back to Step (1), starting at the first disruption method again.

Termination Controller

When a Karpenter node is deleted, the Karpenter finalizer will block deletion and the APIServer will set the DeletionTimestamp on the node, allowing Karpenter to gracefully shutdown the node, modeled after Kubernetes Graceful Node Shutdown. Karpenter’s graceful shutdown process will:

  1. Cordon the node to prevent pods from scheduling to it.
  2. Begin evicting the pods on the node with the Kubernetes Eviction API to respect PDBs, while ignoring all non-daemonset pods and static pods. Wait for the node to be fully drained before proceeding to Step (3).
    • While waiting, if the underlying NodeClaim for the node no longer exists, remove the finalizer to allow the APIServer to delete the node, completing termination.
  3. Terminate the NodeClaim in the Cloud Provider.
  4. Remove the finalizer from the node to allow the APIServer to delete the node, completing termination.

Manual Methods

  • Node Deletion: You could use kubectl to manually remove a single Karpenter node:

    # Delete a specific node
    kubectl delete node $NODE_NAME
    
    # Delete all nodes owned by any nodepool
    kubectl delete nodes -l karpenter.sh/nodepool
    
    # Delete all nodes owned by a specific nodepool
    kubectl delete nodes -l karpenter.sh/nodepool=$NODEPOOL_NAME
    
  • NodePool Deletion: Nodes are owned by the NodePool through an owner reference that launched them. Karpenter will gracefully terminate nodes through cascading deletion when the owning NodePool is deleted.

Automated Methods

  • Expiration: Karpenter will mark nodes as expired and disrupt them after they have lived a set number of seconds, based on the NodePool’s spec.disruption.expireAfter value. You can use node expiry to periodically recycle nodes due to security concerns.
  • Consolidation: Karpenter works to actively reduce cluster cost by identifying when:
    • Nodes can be removed because the node is empty
    • Nodes can be removed as their workloads will run on other nodes in the cluster.
    • Nodes can be replaced with cheaper variants due to a change in the workloads.
  • Drift: Karpenter will mark nodes as drifted and disrupt nodes that have drifted from their desired specification. See Drift to see which fields are considered.
  • Interruption: Karpenter will watch for upcoming interruption events that could affect your nodes (health events, spot interruption, etc.) and will cordon, drain, and terminate the node(s) ahead of the event to reduce workload disruption.

Consolidation

Karpenter has two mechanisms for cluster consolidation:

  1. Deletion - A node is eligible for deletion if all of its pods can run on free capacity of other nodes in the cluster.
  2. Replace - A node can be replaced if all of its pods can run on a combination of free capacity of other nodes in the cluster and a single cheaper replacement node.

Consolidation has three mechanisms that are performed in order to attempt to identify a consolidation action:

  1. Empty Node Consolidation - Delete any entirely empty nodes in parallel
  2. Multi Node Consolidation - Try to delete two or more nodes in parallel, possibly launching a single replacement that is cheaper than the price of all nodes being removed
  3. Single Node Consolidation - Try to delete any single node, possibly launching a single replacement that is cheaper than the price of that node

It’s impractical to examine all possible consolidation options for multi-node consolidation, so Karpenter uses a heuristic to identify a likely set of nodes that can be consolidated. For single-node consolidation we consider each node in the cluster individually.

When there are multiple nodes that could be potentially deleted or replaced, Karpenter chooses to consolidate the node that overall disrupts your workloads the least by preferring to terminate:

  • Nodes running fewer pods
  • Nodes that will expire soon
  • Nodes with lower priority pods

If consolidation is enabled, Karpenter periodically reports events against nodes that indicate why the node can’t be consolidated. These events can be used to investigate nodes that you expect to have been consolidated, but still remain in your cluster.

Events:
  Type     Reason                   Age                From             Message
  ----     ------                   ----               ----             -------
  Normal   Unconsolidatable         66s                karpenter        pdb default/inflate-pdb prevents pod evictions
  Normal   Unconsolidatable         33s (x3 over 30m)  karpenter        can't replace with a cheaper node

Drift

Drift on most fields are only triggered by changes to the owning CustomResource. Some special cases will be reconciled two-ways, triggered by NodeClaim/Node/Instance changes or NodePool/EC2NodeClass changes. For one-way reconciliation, values in the CustomResource are reflected in the NodeClaim in the same way that they’re set. A NodeClaim will be detected as drifted if the values in the CRDs do not match the values in the NodeClaim. By default, fields are drifted using one-way reconciliation.

Two-way Reconciliation

Two-way reconciliation can correspond to multiple values and must be handled differently. Two-way reconciliation can create cases where drift occurs without changes to CRDs, or where CRD changes do not result in drift. For example, if a NodeClaim has node.kubernetes.io/instance-type: m5.large, and requirements change from node.kubernetes.io/instance-type In [m5.large] to node.kubernetes.io/instance-type In [m5.large, m5.2xlarge], the NodeClaim will not be drifted because its value is still compatible with the new requirements. Conversely, for an AWS Installation, if a NodeClaim is using a NodeClaim image ami: ami-abc, but a new image is published, Karpenter’s AWSNodeTemplate.amiSelector will discover that the new correct value is ami: ami-xyz, and detect the NodeClaim as drifted.

Behavioral Fields

Behavioral Fields are treated as over-arching settings on the NodePool to dictate how Karpenter behaves. These fields don’t correspond to settings on the NodeClaim or instance. They’re set by the user to control Karpenter’s Provisioning and disruption logic. Since these don’t map to a desired state of NodeClaims, behavioral fields are not considered for Drift.

Read the Drift Design for more.

NodePool
Fields One-way Two-way
Startup Taints x
Taints x
Labels x
Annotations x
Node Requirements x
Kubelet Configuration x

Behavioral Fields

  • Weight
  • Limits
  • ConsolidationPolicy
  • ConsolidateAfter
  • ExpireAfter

EC2NodeClass
Fields One-way Two-way
Subnet Selector Terms x
Security Group Selector Terms x
AMI Family x
AMI Selector Terms x
UserData x
Tags x
Metadata Options x
Block Device Mappings x
Detailed Monitoring x

To enable the drift feature flag, refer to the Feature Gates.

Karpenter will add the Drifted status condition on NodeClaims if the NodeClaim is drifted from its owning NodePool. Karpenter will also remove the Drifted status condition if either:

  1. The Drift feature gate is not enabled but the NodeClaim is drifted, Karpenter will remove the status condition.
  2. The NodeClaim isn’t drifted, but has the status condition, Karpenter will remove it.

Interruption

If interruption-handling is enabled, Karpenter will watch for upcoming involuntary interruption events that would cause disruption to your workloads. These interruption events include:

  • Spot Interruption Warnings
  • Scheduled Change Health Events (Maintenance Events)
  • Instance Terminating Events
  • Instance Stopping Events

When Karpenter detects one of these events will occur to your nodes, it automatically cordons, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to compute disruption. This enables scenarios where the terminationGracePeriod for your workloads may be long or cleanup for your workloads is critical, and you want enough time to be able to gracefully clean-up your pods.

For Spot interruptions, the NodePool will start a new node as soon as it sees the Spot interruption warning. Spot interruptions have a 2 minute notice before Amazon EC2 reclaims the instance. Karpenter’s average node startup time means that, generally, there is sufficient time for the new node to become ready and to move the pods to the new node before the NodeClaim is reclaimed.

Karpenter enables this feature by watching an SQS queue which receives critical events from AWS services which may affect your nodes. Karpenter requires that an SQS queue be provisioned and EventBridge rules and targets be added that forward interruption events from AWS services to the SQS queue. Karpenter provides details for provisioning this infrastructure in the CloudFormation template in the Getting Started Guide.

To enable interruption handling, configure the --interruption-queue-name CLI argument with the name of the interruption queue provisioned to handle interruption events.

Controls

Pod-Level Controls

You can block Karpenter from voluntarily choosing to disrupt certain pods by setting the karpenter.sh/do-not-disrupt: "true" annotation on the pod. This is useful for pods that you want to run from start to finish without disruption. By opting pods out of this disruption, you are telling Karpenter that it should not voluntarily remove a node containing this pod.

Examples of pods that you might want to opt-out of disruption include an interactive game that you don’t want to interrupt or a long batch job (such as you might have with machine learning) that would need to start over if it were interrupted.

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

Examples of voluntary node removal that will be prevented by this annotation include:

Node-Level Controls

Nodes can be opted out of consolidation disruption by setting the annotation karpenter.sh/do-not-disrupt: "true" on the node.

apiVersion: v1
kind: Node
metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

Example: Disable Disruption on a NodePool

NodePool .spec.annotations allow you to set annotations that will be applied to all nodes launched by this NodePool. By setting the annotation karpenter.sh/do-not-disrupt: "true" on the NodePool, you will selectively prevent all nodes launched by this NodePool from being considered in consolidation calculations.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      annotations: # will be applied to all nodes
        karpenter.sh/do-not-disrupt: "true"