Customizing scheduling on Amazon EKS

The interest in Kubernetes spiked in the fall of 2019, according to Google Trends. The US Department of Defense’s announcement that they had deployed Kubernetes on an F-16 could have attributed to the surge in interest. Today, Kubernetes is found in virtually every industry; from building Blockchain networks to 5G networks, customers use Kubernetes to accelerate innovation and construct the foundation of tomorrow’s Internet.

Kubernetes continues to grow in this ever-changing landscape because it is flexible to adapt to a wide array of use cases. Its extensibility allows us to tune Kubernetes to meet the needs of the business. In this post, we demonstrate a proof of concept that makes it easy for you to customize the way Kubernetes schedules your workloads.

Workload scheduling in Kubernetes

The Kubernetes scheduler process (kube-scheduler) is a control plane process that assigns pods to nodes. When you create a pod, kube-scheduler picks the most suitable node in your cluster and schedules the pod on the node. The scheduler filters and ranks matching nodes based on pod configuration, taking into consideration resource requests, affinity rules, topology spread, and more. Kube-scheduler’s default behavior is to spread pods across nodes but there are situations that require fine-grained control over scheduling of pods.

For example, many Amazon Elastic Kubernetes Service (Amazon EKS) customers want to run their workloads on Amazon EC2 Spot to save money, but they also want to run a small percentage of pods on Amazon EC2 On-Demand to account for a potential widespread Spot interruption. Other customers want to spread pods more in a given Availability Zone (AZ) compared to other AZs for specific use cases.

Kube-scheduler doesn’t currently support scheduling pods with arbitrary ratios across nodes with different labels. The solution proposed in this blog is to build a mutating admission webhook that allows you to configure the logic used to filter and rank nodes right in your deployment manifest.

Custom pod scheduling using a mutating admission webhook

Kubernetes API request lifecycle with mutating and validating admission webhooks

In Kubernetes, an admission controller is a piece of code that intercepts requests to the Kubernetes API server before they are persisted in the etcd key-value store. A mutating controller is capable of changing a resource’s properties before creation. For example, you can use a mutating admission controller to add labels to or inject a sidecar into pods before they are created.

The solution proposed in this post uses a mutating admission webhook to intercept pod creation requests and assign pods to nodes. It allows you to define a custom pod scheduling strategy that utilizes node labels to schedule pods proportionally. Below is an example of a custom scheduling strategy:

  annotations:
    custom-pod-schedule-strategy: 'label1Key=label1Value,base=1,weight=0:label2Key=label2Value,weight=1:label3Key=label3Value,weight=1'

When the custom-pod-schedule-strategy annotation is specified in the deployment, the webhook considers this annotation to assign pods across different node labels. Each node label can have base and weight parameters. Base indicates the minimum number of pods to be scheduled on the corresponding node label. Weight indicates the relative distribution of pods across different node labels. Note that only one node label should have base parameter configured.

To understand this better, let’s define two types of nodes, each with a dedicated label, N1 and N2 respectively. The webhook assigns pods to these nodes (using selectors).

D = number of pod replicas in the deployment
N1 = desired number of pods out of D, to be scheduled on nodes with label 1
N2 = desired number of pods out of D, to be scheduled on nodes with label 2

D = N1 + N2

M1 = current number of pods (either in running or pending state) already using nodeSelector with label 1
M2 = current number of pods (either in running or pending state) already using nodeSelector with label 2

Let’s take a look at an example of a Kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: test
  annotations:
    custom-pod-schedule-strategy: 'label1Key=label1Value,base=2,weight=1:label2Key=label2Value,weight=3'
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 10
  ....

In this example above, nodes with the label1Key=label1Value label have a base value of 2 and weight of 1.

Let’s assume nodes with the label1Key=label1Value label are On-Demand nodes, whereas Spot nodes are labeled as label2Key=label2Value and have weight set to 3.

Applying the calculation,
D = 10 (total replicas in the deployment)
N1 = number of pods on nodes with the label1Key=label1Value label
N1 = base + (D – base) x (individual weight / total weight)
N1 = 2 + (10 – 2) x ( 1 / 4) = 2 + 8/4 = 4

N2 = number of pods on nodes with the label2Key=label2Value label
N2 = (D – base) x (individual weight / total weight)
N2 = (10 – 2) x ( 3 / 4) = 8 x 3/4 = 6

D (in this case 10) = N1 (in this case 4) + N2 (in this case 6)

Therefore, if we create 10 pods, the webhook will assign the first 2 pods to an On-Demand node (because the base is set to 2), and the remaining 8 pods be will spread with a ratio of 1:3 (configured through weights) between the On-Demand and Spot instances. Therefore, the On-Demand nodes will run 4 pods (2 + 2) while Spot instances run 6.

The proposed solution comprises one component, PodToNodeAllocator.

PodToNodeAllocator

The PodToNodeAllocator assigns pods proportionally as they are created or scaled. It implements a mutating admission webhook that watches incoming requests to create pods. As the Kubernetes cluster receives requests to create pods, the webhook considers the custom-pod-schedule-strategy and assigns pods to nodes by adding a nodeSelector field in PodSpec. However, please note that it only guarantees the ratio distribution at the launch, such as initial scheduling of the pods when deployment is created.

The PodToNodeAllocator then runs the following steps for every new pod for each node label specified in the custom pod scheduling specification in the deployment’s annotation:

for every new pod (P) created on the API Server
- for every node label (L)
  - M = get the current number of pods (either in running or pending state) already assigned to this node label
  - N = calculate the desired number of pods out of D, to be scheduled on nodes with this label
  - if M >= N, ignore this label L
  - else if M < N, update the pod P’s specification with nodeSelector for label L
    - After the pod’s specification is updated with nodeSelector, the Kubernetes scheduler assigns the pod to the node with the specified node label.

PodToNodeAllocator assigns pods proportionally as they are created or scaled

Proof of concept walkthrough
Prerequisites

You will need the following to complete the tutorial:

Note: We have tested the CLI steps in this post on Amazon Linux 2.

Let’s start by setting a few environment variables:

export AWS_REGION=us-east-1 
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export CLUSTER_NAME=eks-custom-pod-schedule  # choose an existing EKS Cluster name
export ECR_REPO=custom-kube-scheduler-webhook
export SERVICE=custom-kube-scheduler-webhook
export NAMESPACE=custom-kube-scheduler-webhook
export SECRET=custom-kube-scheduler-webhook-certs

Create an EKS cluster

You can skip this step if you would like to use an existing cluster.

Create an EKS cluster using the eksctl command line tool:

eksctl create cluster \
  --name $CLUSTER_NAME \
  --region $AWS_REGION \
  --version 1.21  \
  --managed

Once the cluster is created successfully, proceed to install Karpenter, which is an open-source node provisioning project built for Kubernetes. Its goal is to improve the efficiency and cost of running workloads on Kubernetes clusters. See this blog post to get started.

If you would like to reuse an existing cluster, ensure that the CLUSTER_NAME environment variable is set to match your EKS cluster name.

Deploy the custom scheduling webhook

Once your EKS cluster is available, clone the GitHub repository that contains the source code and deployment files for the admission webhook:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd custom-kubernetes-scheduler

Create an Amazon Elastic Container Registry (Amazon ECR) repository to store the container image for the webhook. The following commands create a new repository if one does not already exist:

IMAGE_REPO="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
IMAGE_NAME=${ECR_REPO}

export ECR_REPO_URI=$(aws ecr describe-repositories --repository-name ${IMAGE_NAME}  | jq -r '.repositories[0].repositoryUri')

if [ -z "$ECR_REPO_URI" ]
then
      echo "${IMAGE_REPO}/${IMAGE_NAME} does not exist. So creating it..."
      ECR_REPO_URI=$(aws ecr create-repository \
        --repository-name $IMAGE_NAME\
        --region $AWS_REGION \
        --query 'repository.repositoryUri' \
        --output text)
      echo "ECR_REPO_URI=$ECR_REPO_URI"
else
      echo "${IMAGE_REPO}/${IMAGE_NAME} already exist..."
fi

Build a container image that contains the Go application and push it to Amazon ECR:

make

The output from the make command looks like the following:

Building the custom-kube-scheduler-webhook binary for Docker (linux)...
Building the docker image: custom-kube-scheduler-webhook:latest...
Sending build context to Docker daemon  262.5MB
Step 1/6 : FROM alpine:latest
....
Successfully built bc4560ae8770
Successfully tagged XXXXXXX.dkr.ecr.us-east-1.amazonaws.com/custom-kube-scheduler-webhook:latest
Pushing the docker image for XXXXXXX.dkr.ecr.us-east-1.amazonaws.com/custom-kube-scheduler-webhook:latest ...
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin XXXXXXX.dkr.ecr.us-east-1.amazonaws.com/custom-kube-scheduler-webhook
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [XXXXXXX.dkr.ecr.us-east-1.amazonaws.com/custom-kube-scheduler-webhook]
d7eaac728432: Pushed 
b2d5eeeaba3a: Layer already exists 
latest: digest: sha256:5d65e7f5578d95221e7691cdb8415a8a80b1c9c553684bea3011df39262cbe4d size: 740

Create a Kubernetes namespace

Create the custom-kube-scheduler-webhook namespace in which the mutating pod webhook will be deployed:

kubectl create ns $NAMESPACE

Create the certificate and secrets

Create a signed certificate and store it in a Kubernetes secret that will be consumed by the mutating pod webhook deployment:

./deploy/webhook-create-signed-cert.sh \
    --service $SERVICE \
    --secret $SECRET \
    --namespace $NAMESPACE

Verify that the Kubernetes secret has been successfully created:

kubectl get secret $SECRET -n $NAMESPACE -o json

Deploy the webhook

Create and apply a MutatingWebhookConfiguration:

export WEBHOOK_CONFIG="deploy/custom-kube-scheduler-webhook-config.yaml"
cat deploy/custom-kube-scheduler-webhook-config-template.yaml | \
    deploy/webhook-patch-ca-bundle.sh >  $WEBHOOK_CONFIG
kubectl apply -f $WEBHOOK_CONFIG

Deploy the webhook:

export WEBHOOK_CONTROLLER="deploy/custom-kube-scheduler-webhook-controller.yaml"
envsubst < deploy/custom-kube-scheduler-webhook-controller-template.yaml > $WEBHOOK_CONTROLLER
kubectl apply -f $WEBHOOK_CONTROLLER

Test with sample deployment

Verify that the solution works. First, create a namespace and annotate it so that the webhook monitors the namespace for new pods:

kubectl apply -f - <<EOF
---
apiVersion: v1
kind: Namespace
metadata:
  name: test
  labels:
    custom-kube-scheduler-webhook: enabled 
---
EOF

Next, create a sample deployment with annotations for custom pod scheduling:

kubectl apply -f - <<EOF
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: test
  annotations:
    custom-pod-schedule-strategy: 'karpenter.sh/capacity-type=on-demand,base=2,weight=1:karpenter.sh/capacity-type=spot,weight=3'
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 10
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: public.ecr.aws/nginx/nginx:latest
        imagePullPolicy: Always
        name: nginx
        resources:
          limits:
            cpu:  400m
            memory: 1600Mi
          requests:
            cpu: 400m
            memory: 1600Mi        
        ports:
        - name: http
          containerPort: 80
---
EOF

On a different terminal, run this command to see the logs from the webhook:

kubectl logs -f -lapp=custom-kube-scheduler-webhook  -n custom-kube-scheduler-webhook         
I0110 13:05:45.187535       1 webhook.go:373] flow=CREATE serviceInstanceNum=67 Found a deployment nginx in namespace test with total replicas 10 and strategy=karpenter.sh/capacity-type=on-demand,base=2,weight=1:karpenter.sh/capacity-type=spot,weight=3
I0110 13:05:45.187561       1 webhook.go:378] flow=CREATE serviceInstanceNum=67 nodeLabelStrategyList=[{karpenter.sh/capacity-type=on-demand 4 1} {karpenter.sh/capacity-type=spot 6 3}]
I0110 13:05:45.195088       1 webhook.go:389] flow=CREATE serviceInstanceNum=67 nodeLabel=karpenter.sh/capacity-type=on-demand currently runs 0 pods
I0110 13:05:45.195107       1 webhook.go:393] flow=CREATE serviceInstanceNum=67 Currently running 0 pods is less than expected 4, scheduling pod on nodeLabel karpenter.sh/capacity-type=on-demand
I0110 13:05:45.195130       1 webhook.go:232] serviceInstanceNum=67 AdmissionResponse: patch=[{"op":"add","path":"/spec/nodeSelector","value":{"karpenter.sh/capacity-type":"on-demand"}}]
I0110 13:05:45.195150       1 webhook.go:311] Ready to write reponse ...
....
I0110 13:05:45.315759       1 webhook.go:378] flow=CREATE serviceInstanceNum=73 nodeLabelStrategyList=[{karpenter.sh/capacity-type=on-demand 4 1} {karpenter.sh/capacity-type=spot 6 3}]
I0110 13:05:45.339624       1 webhook.go:393] flow=CREATE serviceInstanceNum=73 Currently running 0 pods is less than expected 6, scheduling pod on nodeLabel karpenter.sh/capacity-type=spot
I0110 13:05:45.339643       1 webhook.go:232] serviceInstanceNum=73 AdmissionResponse: patch=[{"op":"add","path":"/spec/nodeSelector","value":{"karpenter.sh/capacity-type":"spot"}}]
I0110 13:05:45.339661       1 webhook.go:311] Ready to write reponse ...

The project includes a helper script that shows you the pod distribution across node types:

./check_pod_spread.sh
NAME                     READY   STATUS    RESTARTS   AGE
nginx-6b6769fd96-fwwtk   1/1     Running   0          2m58s
.....
Number of Pods in namespace test is 1
Number of Pods for on-demand is 4
Number of Pods for spot is 6

Now, scale the sample deployment from 10 replicas to 20:

kubectl scale deployment nginx  --replicas=20 -n test

Check the pod distribution to verify proportionality:

jp:~/environment/custom-kubernetes-scheduler/admissionwebhook (main) $ ./check_pod_spread.shNAME                     READY   STATUS    RESTARTS   AGE
nginx-6b6769fd96-2qt9h   1/1     Running   0          19s
...
Number of Pods in namespace test is 20
Number of Pods for on-demand is 6
Number of Pods for spot is 14

As you can see, when we double the replica count, new pods are scheduled proportionally.

Cleanup

Use these commands to delete the resources created during this post:

kubectl delete deployment nginx -n test
kubectl delete ns test
kubectl delete -f $WEBHOOK_CONTROLLER
kubectl delete -f $WEBHOOK_CONFIG
kubectl delete secret $SECRET -n $NAMESPACE 
kubectl delete ns $NAMESPACE
# if you created a new EKS cluster, then delete the deleter
eksctl delete cluster  --name $CLUSTER_NAME  --region $AWS_REGION

Conclusion

This post demonstrates how you can use a mutating pod admission webhook to customize pod scheduling across nodes. You can use this solution for a variety of use cases such as prioritizing nodes in an Availability Zone to reduce data transfer costs, spreading workloads across Availability Zones, or running workloads across On-Demand and Spot instances.

Containers