How to upgrade Amazon EKS worker nodes with Karpenter Drift

[May, 2024 – This blog has been updated to reflect Karpenter v1beta1 API changes]

Introduction

Karpenter is an open-source cluster autoscaler that provisions right-sized nodes in response to unschedulable pods based on aggregated CPU, memory, volume requests, and other Kubernetes scheduling constraints (e.g., affinities and pod topology spread constraints), which simplifies infrastructure management. When using Cluster Autoscaler as an alternative autoscaler, all Kubernetes nodes in a node group must have the same capacity (vCPU and memory) for autoscaling to work effectively. This results in customers having many node groups of different instance sizes, each backed by an Amazon EC2 Auto Scaling group, to meet the requirements of their workload. As a workload continually evolves overtime, the changing resource requirements mean picking the right-sized Amazon Elastic Compute Cloud (Amazon EC2) instances can be challenging. In addition, as Karpenter doesn’t orchestrate capacity management with external infrastructure like node groups and Amazon EC2 auto scaling groups, it introduces a different perspective to operational processes to keep worker node components and operating systems up to date with the latest security patches and features.

In this post, we’ll describe the mechanism for patching Kubernetes worker nodes provisioned with Karpenter through a Karpenter feature called Drift. If you have many worker nodes across multiple Amazon EKS clusters, then this mechanism can help you continuously patch at scale.

Solution overview

Karpenter node patching mechanisms

When Amazon EKS supports a new Kubernetes version, you can upgrade your Amazon Elastic Kubernetes Service (Amazon EKS) cluster control plane to the next version with a single API call. Upgrading the Kubernetes data plane involves updating the Amazon Machine Image (AMI) for the Kubernetes worker nodes. AWS releases AMIs for new Kubernetes versions as well as patches and CVEs(Common Vulnerabilities and Exposures). You can choose from a wide variety of Amazon EKS-optimized AMIs. Alternatively, you can also use your own custom AMIs. Currently, Karpenter in the EC2NodeClass resource supports amiFamily values AL2, AL2023, Bottlerocket, Ubuntu, Windows2019, Windows2022 and Custom. When an amiFamily of Custom is chosen, then an amiSelectorTerms must be specified that informs Karpenter on which custom AMIs are to be used.

Karpenter uses Drift to upgrade Kubernetes nodes following a rolling deployment. As nodes are de-provisioned, nodes are cordoned to prevent new pods scheduling and pods are evicted using the Kubernetes Eviction API. The Drift mechanism is as follows:

Drift

For Kubernetes nodes provisioned with Karpenter that have drifted from their desired specification, Karpenter provisions new nodes first, evicts pods from the old nodes, and then terminates. At the time of writing this post, the Drift interval is set to 5 minutes. However, if the NodePool or EC2NodeClass is updated, then the Drift check is triggered immediately. In EC2NodeClass, amiFamily is a required field, and you can use your own AMI value, or EKS Optimized AMIs. Drift for AMIs has two behaviors in these two cases which are detailed below.

Drift with specified AMI values

You may consider this approach to control the promotion of AMIs through application environments for consistency. If you change the AMI(s) in the EC2NodeClass for a NodePool or associate a different EC2NodeClass with the NodePool, Karpenter detects that the existing worker nodes have drifted from the desired setting.

To trigger the upgrade, associate the new AMI(s) in the EC2NodeClass and Karpenter upgrades the worker nodes via a rolling deployment. AMIs can be specified explicitly by AMI ID, AMI names, or even specific tags. If multiple AMIs satisfy the criteria, then the latest AMI is chosen. Customers can track which AMIs are discovered by the EC2NodeClass from the AMI value(s) under status field in EC2NodeClass. One way of getting the status would be by running kubectl describe on the EC2NodeClass. In certain scenarios, if both the old and the new AMIs are discovered by the EC2NodeClass, then the running nodes with old AMIs will be drifted, de-provisioned and replaced with worker nodes with the new AMI. The new nodes are provisioned using the new AMI. To learn more about selecting AMIs in the EC2NodeClass, refer here.

amiSelectorTerms:
- id: "ami-123"
- id: "ami-456"

Example 1 – Select AMIs by IDs

amiSelectorTerms:
- name: appA-ami
  owner: 0123456789

Example 2 – Select AMIs where Name tag has the value appA-ami, in the application account 0123456789

Drift with Amazon EKS optimized AMIs

If there is no amiSelectorTerms specified in the EC2NodeClass, then Karpenter monitors the SSM parameters published for the Amazon EKS-optimized AMIs. You can specify a value from AL2, AL2023, Bottlerocket, Ubuntu, Windows2019, or Windows2022 in the field amiFamily to tell Karpenter which Amazon EKS-optimized AMI should it use. Karpenter provisions nodes with the latest Amazon EKS-optimized AMI for the amiFamily specified, for the EKS version cluster is running with. Karpenter detects when a new AMI is released for the version of the Kubernetes cluster and drifts the existing nodes. EC2NodeClass AMIs value under status field reflect the newly discovered AMI. Those nodes are de-provisioned and replaced with worker nodes with the latest AMI. With this approach, the nodes with older AMIs are recycled automatically (e.g., when there is a new AMI available or after a Kubernetes control plane upgrade). With the previous approach of using amiSelectorTerms, you have more control when the nodes are upgraded. Consider the difference and select the approach suitable for your application. Karpenter currently doesn’t support custom SSM parameters.

Walkthrough

We’ll walk through the following scenarios:

Enabling the Karpenter Drift feature gate
Automation of node upgrade with Drift
Node upgrade with controlling promotion of AMIs

Prerequisites

You’ll need the following to complete the steps in this post:

An existing Amazon EKS cluster. If you don’t have one, please follow any one method described here to create a cluster.
An existing latest Karpenter deployment. Please follow the getting started with Karpenter guide listed here to install Karpenter.

We’ll first export the Amazon EKS cluster name to proceed the walkthrough.

export CLUSTER_NAME=<your EKS cluster name>

Step 1. Enabling the Karpenter Drift feature gate

Since Karpenter version 0.33, Drift is enabled by default. You can disable the drift feature by specifying –feature-gates DriftEnabled=false in the command line arguments to Karpenter.

Step 2. Automate the worker node upgrade with Drift

In this example, we’re specifying the amiFamily field with value of AL2 to target AL2 EKS Optimized AMIs.

mkdir -p ~/environment/karpenter
cd ~/environment/karpenter

cat <<EoF> basic.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
  limits:
    cpu: "1000"
  template:
    metadata:
        labels:
          team: my-team
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m", "r"]
      - key: "karpenter.k8s.aws/instance-generation"
        operator: Gt
        values: ["2"]
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  role: karpenterNodeRole-$CLUSTER_NAME
  securityGroupSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  subnetSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  tags:
    intent: apps
    managed-by: karpenter
EoF

kubectl apply -f basic.yaml

Note: Select your own subnets and security groups if your Amazon EKS cluster isn’t provisioned by eksctl. Refer to this page for more details in discovering subnets and security groups with Karpenter EC2NodeClass.

Let’s deploy a sample deployment, named inflate to scale the worker nodes:

cd ~/environment/karpenter

cat <<EoF> sample-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 2
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1
              memory: 128Mi
            limits:
              memory: 128Mi
      nodeSelector:
        team: my-team
EoF

kubectl apply -f sample-deploy.yaml

You can check the Karpenter logs to see that Karpenter found unscheduable (i.e., provisionable) pods and created new nodes to accommodate the pending Pods:

$ kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter

{"level":"INFO","time":"2024-01-28T16:01:38.547Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"a70b39e","pods":"default/inflate-67b454659d-xdsml, default/inflate-67b454659d-258ms","duration":"63.969954ms"}
{"level":"INFO","time":"2024-01-28T16:01:38.547Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"a70b39e","nodeclaims":1,"pods":2}
{"level":"INFO","time":"2024-01-28T16:01:38.568Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"a70b39e","nodepool":"default","nodeclaim":"default-w6fg6","requests":{"cpu":"2150m","memory":"256Mi","pods":"4"},"instance-types":"c6a.2xlarge, c6a.4xlarge, c6a.xlarge, c6g.2xlarge, c6g.4xlarge and 95 other(s)"}
{"level":"INFO","time":"2024-01-28T16:01:41.860Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"a70b39e","nodeclaim":"default-w6fg6","provider-id":"aws:///us-east-1b/i-0caeae25f3db6f37c","instance-type":"m6g.xlarge","zone":"us-east-1b","capacity-type":"spot","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"14103Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}

Next, check the AMI version of a newly deployed node. In this demonstration environment, an AMI version is v1.28:

$ kubectl get nodes -l team=my-team

NAME                            STATUS   ROLES    AGE   VERSION
ip-192-168-40-30.ec2.internal   Ready    <none>   25s   v1.28.5-eks-5e0fdde

Now let’s check the Amazon EKS control plane version. We’re assuming the control plane version is equivalent to the node version:

$ kubectl version --short
Server Version: v1.28.5-eks-5e0fdde

We’ll now upgrade the Amazon EKS control plane and validate if the worker node(s) are automatically updated to the new version that matches the control plane version. You can use your own preferred way to upgrade it but we’ll use AWS Command Line Interface (AWS CLI) as an example here. Replace the region-code with your own. Replace 1.28 with the Amazon EKS-supported version number that you want to upgrade your cluster to. For best practices on Amazon EKS cluster upgrades see the clusters upgrade section of the Amazon EKS best practices guide.

$ aws eks update-cluster-version --region <region-code> --name $CLUSTER_NAME --kubernetes-version 1.29

Monitor the status of your cluster update with the following command. Use the update ID that the previous command returned and replace the <update-id> with that value in the following command. When a Successful status is displayed, the upgrade is complete.

$ aws eks describe-update --region <region-code> --name $CLUSTER_NAME --update-id <update-id>

After the status changes to Active, let’s check the Karpenter logs. You can check that Karpenter detected a drift and started deprovisioning node via drift and replaces with a new node.

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter | grep -i drift
 
{"level":"INFO","time":"2024-01-28T16:24:15.675Z","logger":"controller.disruption","message":"disrupting via drift replace, terminating 1 candidates ip-192-168-40-30.ec2.internal/m6g.xlarge/spot and replacing with node from types m7i.metal-24xl, r6in.24xlarge, c6i.32xlarge, m6gd.4xlarge, r6a.2xlarge and 290 other(s)","commit":"a70b39e"}

Let’s check the AMI version of the node:

$ kubectl get nodes -l team=my-team

You’ll see a v1.28 node status is Ready, SchedulingDisabled and a newly deployed v1.29 node is NotReady yet.

$ kubectl get nodes -l team=my-team

NAME STATUS ROLES AGE VERSION
ip-192-168-27-17.us-west-2.compute.internal Ready,SchedulingDisabled <none> 55m v1.24.15-eks-a5565ad
ip-192-168-41-50.us-west-2.compute.internal NotReady <none> 13s v1.25.11-eks-a5565ad

After few seconds, you can run $ kubectl get nodes -l team=my-team again to check the new v.1.29 node is ready and the previous v1.28 node is terminated.

$ kubectl get nodes -l team=my-team 
NAME                            STATUS   ROLES    AGE     VERSION
ip-192-168-153-8.ec2.internal   Ready    <none>   2m51s   v1.29.0-eks-5e0fdde

Note: The actual amount of time for node upgrade varies by the environment.

Step 3. Node upgrade with controlling promotion of AMIs

As we just saw, Karpenter Drift automatically upgrades the node AMI version when the Amazon EKS control plane is upgraded with an Amazon EKS-optimized Amazon Linux AMI. However, there are use-cases (e.g., prompting AMIs through environments) that you want to have more controls on when to initiate the AMI update with a specific AMI. For that, if you specify the AMI in the amiSelectorTerms (under EC2NodeClass), nodes will only be updated when you explicitly change the AMI without following the control plane update.

For this example, we’re using Bottlerocket OS for running containers. Bottlerocket is a Linux-based open-source operating system that is purpose-built by Amazon Web Services for running containers. For more details on benefits of using BottleRocket OS, please refer to https://aws.amazon.com/bottlerocket/.

Note – In the below example, we’re using BottleRocket AMI. Karpenter will automatically query for the appropriate EKS optimized AMI via AWS Systems Manager (SSM). In the case of the Custom amiFamily, no default AMIs are defined. As a result, amiSelectorTerms must be specified to inform Karpenter on which custom AMIs are to be used.

cd ~/environment/karpenter

cat << EOF > bottlerocket.yaml
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: bottlerocket
spec:
  amiFamily: Bottlerocket
  role: karpenterNodeRole-$CLUSTER_NAME
  securityGroupSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  subnetSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  tags:
    managed-by: "karpenter"
    intent: "apps"
EOF

kubectl -f bottlerocket.yaml create

Now, let’s edit the default NodePool to use this newly created EC2NodeClass, bottlerocket.

kubectl edit nodepools.karpenter.sh default

Search nodeClassRef under specifications and change the name value from default to bottlerocket:

....
spec:
  disruption:
    consolidateAfter: 60s
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  limits:
    cpu: "1000"
  template:
    metadata:
      labels:
        team: my-team
    spec:
      nodeClassRef:
        name: bottlerocket

Let’s check the Karpenter logs. You can check that Karpenter detected the drift and deprovisioning node via drift replace with a new node:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter | grep -i drift

{"level":"INFO","time":"2024-01-31T18:53:52.084Z","logger":"controller.disruption","message":"disrupting via drift replace, terminating 1 candidates ip-192-168-37-223.ec2.internal/c6i.xlarge/spot and replacing with node from types r7iz.16xlarge, c6a.24xlarge, c6a.16xlarge, c7gn.16xlarge, r7a.xlarge and 290 other(s)","commit":"a70b39e"}
{"level":"INFO","time":"2024-01-31T18:53:52.153Z","logger":"controller.disruption","message":"created nodeclaim","commit":"a70b39e","nodepool":"default","nodeclaim":"default-wq2lm","requests":{"cpu":"2150m","memory":"256Mi","pods":"4"},"instance-types":"c6a.2xlarge, c6a.4xlarge, c6a.xlarge, c6g.2xlarge, c6g.4xlarge and 95 other(s)"}
{"level":"INFO","time":"2024-01-31T18:53:55.622Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"a70b39e","nodeclaim":"default-wq2lm","provider-id":"aws:///us-east-1b/i-0f49d524524fbf2a8","instance-type":"m6g.xlarge","zone":"us-east-1b","capacity-type":"spot","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"14103Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2024-01-31T18:54:25.979Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"a70b39e","nodeclaim":"default-wq2lm","provider-id":"aws:///us-east-1b/i-0f49d524524fbf2a8","node":"ip-192-168-150-253.ec2.internal"}
{"level":"INFO","time":"2024-01-31T18:54:27.932Z","logger":"controller.node.termination","message":"tainted node","commit":"a70b39e","node":"ip-192-168-37-223.ec2.internal"}
{"level":"INFO","time":"2024-01-31T18:54:29.620Z","logger":"controller.node.termination","message":"deleted node","commit":"a70b39e","node":"ip-192-168-37-223.ec2.internal"}
{"level":"INFO","time":"2024-01-31T18:54:29.978Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"a70b39e","nodeclaim":"default-mjxrt","node":"ip-192-168-37-223.ec2.internal","provider-id":"aws:///us-east-1b/i-0d4eed64ea7d3e090"}

Let’s check the AMI version of the node:

$ kubectl get nodes -l team=my-team

You’ll see an existing Amazon EKS-optimized Linux v1.29 AMI (v1.29.0-eks-5e0fdde) status is Ready, SchedulingDisabled and a newly deployed bottlerocket v.1.28 node (v1.28.2) is NotReady yet.

$ kubectl get nodes -l team=my-team

NAME                                  STATUS                             ROLES    AGE   VERSION
ip-192-168-153-8.ec2.internal         Ready,SchedulingDisabled           <none>   65m   v1.29.0-eks-5e0fdde
ip-192-168-150-253.ec2.internal       NotReady                           <none>   14s  v1.28.2

After few seconds, you can now check the new Bottlerocket v1.29 node is ready and the previous Amazon EKS-optimized Linux AMI v1.29 node is terminated.

$ kubectl get nodes -l team=my-team

NAME STATUS ROLES AGE VERSION ip-192-168-150-253.ec2.internal Ready <none> 30s v1.29.0-eks-a5ec690

To verify the AMI of the new node, run:

$ kubectl describe node <NAME from previous command output> | grep -i "OS Image"

OS Image:                   Bottlerocket OS 1.18.0 (aws-k8s-1.29)

When using Karpenter, there are some additional design considerations that can help you achieve continuous operations:

Use Pod Topology Spread Constraints to spread workloads across fault domains for high availability – Similar to pod anti-affinity rules, pod topology spread constraints allow you to make your application available across different failure (or topology) domains like hosts or availability zones.
Consider Pod Readiness Gates – For workloads that ingress via an Elastic Load Balancer (ELB) to validate whether workloads are successfully registered to target groups, consider using Pod readiness gates. See the Amazon EKS best practices guide for more information.
Consider Pod Disruptions Budgets – Use Pod disruption budgets to control the termination of pods during voluntary disruptions. Karpenter respects Pod disruption budgets (PDBs) by using a backoff retry eviction strategy.
Consider whether automatic AMI selection is the right approach – It is recommended to consider the latest and greatest Amazon EKS optimized AMIs; however, if you would like to control the roll out of AMIs across environments then think whether you’d let Karpenter pick the latest AMI or you’d specify your own AMI. By specifying your own AMI, you can control promotion of AMIs through application environments.
Consider setting karpenter.sh/do-not-disrupt: “true” – For workloads that might not be interruptible (e.g., long running batch jobs without checkpointing), consider annotating pods with the do not disrupt annotation. By opting pods out of disruption, you are telling Karpenter that it shouldn’t voluntarily remove nodes containing this pod. Or you can also set the karpenter.sh/do-not-disrupt annotation on the node which will prevent disruption actions on the node.

Cleaning up

To clean up the resources created, you can execute the following steps:

Delete the Karpenter NodePool to deprovision nodes, cleanup the EC2NodeClass, and sample application:
1. kubectl delete -f basic.yml
2. kubectl delete -f bottlerocket.yaml
3. kubectl delete -f sample-deploy.yaml
If you created a new Amazon EKS cluster for the walkthrough, then don’t forget to clean up any resources or you incur costs.

Conclusion

For customers with many Kubernetes clusters and node groups, adopting Karpenter simplifies the infrastructure management. In this post, we described approaches on how to upgrade and patch Kubernetes nodes when using a Karpenter feature called Drift. These patching strategies can reduce your undifferentiated heavy lifting, which help you patch worker nodes at scale by moving from a point-in-time strategy to a continuous mechanism. The Karpenter Drift feature is still evolving and for the latest up to date information, checkout the Karpenter documentation.

If you would like to learn more, then come and discuss Karpenter in the #karpenter channel in the Kubernetes slack or join the Karpenter working group calls.

To get hands-on information, then checkout the Karpenter workshop.

Containers