Kubernetes Pod Eviction and Preemption: How do they work?

In Kubernetes best practices, implementing quotas and limits for each namespace addresses some of the cluster's activity overload issues.

As an interesting anecdote, one day I encountered the following problem: I couldn't enforce quotas or impose limits, or anything of the sort in various namespaces for non-technical reasons.

So, I delved into the mechanics of pod scheduling and eviction to see how I could make certain pods more prioritized than others.

We will revisit together the logic used by Kubernetes when it comes to eviction (killing pods due to insufficient resources) or preemption of pods (starting pods) in order to improve your resilience.

Kubernetes pod eviction

When your cluster is under saturation, you can, of course, activate auto-scaling for your nodes and pods, but this isn't instantaneous, especially for nodes. This task can be more challenging in on-premises environments, for example.

When your cluster reaches saturation, Kubernetes must make decisions to ensure that priority pods are not evicted from nodes under pressure. If you see pods with a "evicted" status, this should pique your interest.

Kubernetes assigns a Quality of Service (QoS) to each pod based on their requests and limits:

Guaranteed: This QoS is the safest and provides you with more assurance of not being evicted. Each pod must have a request equal to its limit for CPU and memory.

Burstable: This QoS is assigned if you have at least one "request" for your pod.

BestEffort: This QoS is the first to be evicted when the node is under pressure. If you don't use any limits or requests, this QoS will be assigned to your pod.

Understanding the importance of setting limits and requests for your workloads is essential to better anticipate their behavior during peak loads.

Kubernetes pod Preemption

We've just seen what happens with pod eviction, meaning pods that are already running. To recap, Quality of Service (QoS) is solely intended to protect pods during eviction, not during scheduling. For example, if a pod fails and needs to be rescheduled when the cluster is at 100% capacity, what comes into play is what we call pod preemption.

In other words, it's a priority given to pods that Kubernetes considers during scheduling. By default, there are several PriorityClasses commonly used by system components:

$ kubectl get priorityclass
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            7m56s
system-node-critical      2000001000   false            7m56s

If we take a closer look at how this works on GKE (Google Kubernetes Engine)

kubectl get pods --all-namespaces -o=custom-columns=NAME:.metadata.name,PRIORITY_CLASS:.spec.priorityClassName --sort-by='.metadata.creationTimestamp' | column -t

NAME                                                 PRIORITY_CLASS
kube-dns-7d5998784c-kw98d                            system-cluster-critical
kube-dns-autoscaler-9f89698b6-shzj8                  system-cluster-critical
event-exporter-gke-857959888b-kwpwm                  <none>
konnectivity-agent-67c68ff74f-kwfhl                  system-cluster-critical
konnectivity-agent-autoscaler-bd45744cc-49bbf        system-cluster-critical
l7-default-backend-6dc845c45d-4jt95                  <none>
gke-metrics-agent-hj942                              system-node-critical
fluentbit-gke-qmmbl                                  system-node-critical
pdcsi-node-x5fbb                                     system-node-critical
pdcsi-node-6n2tl                                     system-node-critical
gke-metrics-agent-gxm2s                              system-node-critical
fluentbit-gke-4bp9n                                  system-node-critical
fluentbit-gke-pd4dh                                  system-node-critical
gke-metrics-agent-wpqft                              system-node-critical
pdcsi-node-k2z6v                                     system-node-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-85qj  system-node-critical
konnectivity-agent-67c68ff74f-5bxbk                  system-cluster-critical
kube-dns-7d5998784c-9qfnm                            system-cluster-critical
konnectivity-agent-67c68ff74f-hj2b4                  system-cluster-critical
metrics-server-v0.5.2-6bf845b67f-gslvk               system-cluster-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-p2tk  system-node-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-7mzj  system-node-critical

We can see that all the pods except 2 have high PriorityClasses and can still be rescheduled even under pressure.

Use case Kyverno

Let's imagine that we want to install Kyverno. Here are the commands to install Kyverno, for example.

helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno -n kyverno --create-namespace --set replicaCount=3

By default, pods don't have any PriorityClass, but I consider this application to be very important and it should always be up and running despite possible overload. We can see that Kubernetes doesn't assign any QoS Guarantee or PriorityClass.

vincn_ledan@cloudshell:~ (medium-article-377208)$ kubectl get pods -n kyverno -o custom-columns="NAME:.metadata.name, QOS:.status.qosClass, REQUESTS:.spec.containers[*].resources.requests, LIMITS:.spec.containers[*].resources.limits, PRIORITY_CLASS:.spec.priorityClassName"
NAME                                          QOS         REQUESTS                     LIMITS              PRIORITY_CLASS
kyverno-5f69449d46-9gcfc                     Burstable   map[cpu:100m memory:128Mi]   map[memory:384Mi]   <none>
kyverno-5f69449d46-c96w6                     Burstable   map[cpu:100m memory:128Mi]   map[memory:384Mi]   <none>
kyverno-5f69449d46-f8dgb                     Burstable   map[cpu:100m memory:128Mi]   map[memory:384Mi]   <none>
kyverno-cleanup-controller-8d8cbd588-8tlrh   Burstable   map[cpu:100m memory:64Mi]    map[memory:128Mi]   <none>
kyverno-cleanup-controller-8d8cbd588-gcl6b   Burstable   map[cpu:100m memory:64Mi]    map[memory:128Mi]   <none>
kyverno-cleanup-controller-8d8cbd588-ghtf8   Burstable   map[cpu:100m memory:64Mi]    map[memory:128Mi]   <none>

Now let's ensure that these pods are in the Guaranteed QoS class and also in a specific PriorityClass. We will assign the value 999999999, which is below the value of system PriorityClasses and just below the maximum value of 1 billion.

Let's create the priority.yaml file and apply it:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-class
value: 999999999
globalDefault: false
description: " high priority"

Let's reinstall Kyverno with the correct parameters:

helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=3 \
  --set test.resources.limits.cpu=100m \
  --set test.resources.limits.memory=128Mi \
  --set test.resources.requests.cpu=100m \
  --set test.resources.requests.memory=128Mi \
  --set resources.requests.cpu=100m \
  --set resources.requests.memory=128Mi \
  --set resources.limits.cpu=100m \
  --set resources.limits.memory=128Mi \
  --set initResources.limits.memory=64Mi \
  --set initResources.limits.cpu=10m \
  --set initResources.requests.memory=64Mi \
  --set initResources.requests.cpu=10m \
  --set cleanupController.resources.limits.memory=64Mi \
  --set cleanupController.resources.limits.cpu=100m \
  --set cleanupController.resources.requests.memory=64Mi \
  --set cleanupController.resources.requests.cpu=100m \
  --set priorityClassName=high-priority-class

Now we can see that the pods have the correct QoS and the correct PriorityClass.

vincn_ledan@cloudshell:~ (medium-article)$ kubectl get pods -n kyverno -o custom-columns="NAME:.metadata.name, QOS:.status.qosClass, REQUESTS:.spec.containers[*].resources.requests, LIMITS:.spec.containers[*].resources.limits, PRIORITY_CLASS:.spec.priorityClassName"
NAME                                          QOS          REQUESTS                     LIMITS                       PRIORITY_CLASS
kyverno-6fd6db997c-ptc8n                     Guaranteed   map[cpu:100m memory:256Mi]   map[cpu:100m memory:256Mi]   high-priority-class
kyverno-6fd6db997c-q7hwj                     Guaranteed   map[cpu:100m memory:256Mi]   map[cpu:100m memory:256Mi]   high-priority-class
kyverno-6fd6db997c-zwj5m                     Guaranteed   map[cpu:100m memory:256Mi]   map[cpu:100m memory:256Mi]   high-priority-class

First, we are going to stress the cluster.

kubectl apply -f https://raw.githubusercontent.com/giantswarm/kube-stresscheck/master/examples/cluster.yaml

Then, deploy a simple nginx pod. It can be noticed that this pod cannot be scheduled because there are no resources available.

kubectl create deploy nginxtest --image=nginx
kubectl get pods 
default       nginxtest-699448454d-8kmpb                            0/1     Pending                  0             11s

Now, let's delete one of the Kyverno pods; this should allow the pod to be scheduled due to its high priority compared to the nginx pod.

kubectl delete pods kyverno-6fd6db997c-q7hwj -n kyverno --force
kubectl get pods -n kyverno
NAME                                         READY   STATUS                   RESTARTS   AGE
kyverno-6fd6db997c-ptc8n                     1/1     Running                  0          16m
kyverno-6fd6db997c-xc226                     1/1     Running                  0          9m32s
kyverno-6fd6db997c-zwj5m                     1/1     Running                  0          16m

Advice

It is important to be able to classify your applications in order to assign them the correct QoS and PriorityClass and better manage the cluster's behavior.