Resource Management in Kubernetes


If your cluster is small, it might be fine to just trust that everything is fine, behaving nicely, and gets the resources they need.

Having experienced a bit of performance issues, greedy components eating much resourcers, and even affecting the stability of the cluster itself, I found that it was time to dive into the topic on how to manage resources a bit better.

Tuning a small node for reliability

There’s a bunch of timeouts here and there in Kubernetes compoents. They all assume that your central components are healthy, have enough resources, and if something doesn’t respond within a few seconds, it’s just best to just restart it to make it healthy again. On a small home-node, this might not always be the best strategy.

When k3s restarts, it doesn’t take the containers with it, though, so a restart or two now and again is fine. However, sometimes if you have issues, your components will get into restart loops, and even your containers might start crashing. If the system is struggling in itself, they might stay down for some time until things are healthy enough that they gets restarted.

There’s a few things you can tune to let k3s not be as quick on the restart trigger. Here’s my current settings. All of these timeoutes have been increased quite a bit.

I have also created a few reserved CPUs. However, they’re not as reserved as I’d like. But it’s at least possible to get it much better…

kube-controller-manager-arg:
- leader-elect=true
- leader-elect-lease-duration=30s
- leader-elect-renew-deadline=20s
- leader-elect-retry-period=10s
kube-scheduler-arg:
- leader-elect=true
- leader-elect-lease-duration=30s
- leader-elect-renew-deadline=20s
- leader-elect-retry-period=10s
kube-apiserver-arg:
- request-timeout=120s
- min-request-timeout=60
- default-not-ready-toleration-seconds=60
- default-unreachable-toleration-seconds=60
kubelet-arg:
- "system-reserved=cpu=1000m,memory=5000Mi"
- "kube-reserved=cpu=1500m,memory=3000Mi"
- "cgroup-driver=systemd"
- "cpu-manager-policy=static"
- "reserved-cpus=0,1"
etcd-arg:
- quota-backend-bytes=8589934592 # 8GB to avoid frequent compactions
- election-timeout=5000 # Default is 1000ms, more lenient
- heartbeat-interval=250

cpu-manager-policy=static is needed to give important PODs some exclusivity, but as a side effect also limit their ability to interfer with system components…

I have also tuned k3s process a bit, pinning the system compoents (and burstable PODs, it seems) to CPUs 0-3, with adding this to /etc/systemd/system/k3s.service

...
[Service]
....
CPUAffinity=0 1 2 3
# Increase K3s priority slightly
Nice=-5

It also gives k3s a bit of priority with the Nice value.

There’s also some other tips and tricks here. I especially added the –protect-kernel-defaults flag and changed the sysctl settings, though I’m not really sure it was needed or even wise.

PriorityClass

If k3s needs to shut down some workloads because it’s out of resources, PriorityClass can influence what workloads are killed first – and what workloads to give the little resources are left, if it can’t restart them all.

So, I defined this as my normal workload PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: normal-priority
value: 10000
globalDefault: true
description: "Normal priority"

Higher value means higher priority. globaldefault means it’s the default policy if none other is set.

There is also a few priorityclasses that are built in, and some componets will create their own priorityclasses as you install them with helm.

In addition, I created the policies low-priority, normal-plus-priority and infra-critical-priority, for the workloads I define and control.

I have put my plex in low-priority as the only one. In normal-priority I have my normal things like nextcloud, wordpress and other compoents. In normal-plus-priority, I have more central components, like Keycloak, Traefik and my monitoring. Jury is still out whether I want monitoring there, but it works for now.

With the once generated before, I now have the following priority classes:

NAME                                      VALUE        GLOBAL-DEFAULT   AGE     PREEMPTIONPOLICY
infra-critical-priority 50000 false 6d22h PreemptLowerPriority
low-priority -1000 false 6d7h Never
normal-plus-priority 20000 false 6d22h PreemptLowerPriority
normal-priority 10000 true 6d23h PreemptLowerPriority
openebs-zfs-zfs-csi-controller-critical 900000000 false 97d PreemptLowerPriority
openebs-zfs-zfs-csi-node-critical 900001000 false 97d PreemptLowerPriority
system-cluster-critical 2000000000 false 97d PreemptLowerPriority
system-node-critical 2000001000 false 97d PreemptLowerPriority

But these doesn’t do anything runtime, so we’ll need more to make sure resources get the resources they need, and don’t eat more resources.

Resource Limits

On pods, you can specify what resources you want to give them, which in turn also affects what QoS class they get. If their needs are static, and you want to make sure they get the resources, you can make sure they get into the Guaranteed QoS class.

The resources setting has two sections requests and limits, the first one is the minimum resources, the 2nd is the limit. In each of these you can specify cpu and memory settings. It’s customary to specify cpu in 1/1000 of a CPU, or m(illi). A value of 1 is one (virtual) CPU.

Example: Traefik

     resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 1000m
memory: 1Gi

This gives traefik 100 millicpus guaranteed, and starts at 100MB memory. It will cap CPU at 1000 millicpus and memory at 1GB. If it tries to allocate more memory, it will be killed! For CPU, it just won’t get more.

You don’t have to set all of these numbers, it’s perfectly fine to omit memory settings, or even limit for the cpu. But giving it some limits make sure it doesn’t go haywire. And in fact, right now I am running a traefik plugin that has a memory leak! Since I run 2 traefik pods, they’ll just be restarted whenever they have grown to 1 GB in size.

Settings some requests will make your pod get into the burstable class, as opposed to best effort which it gets when it has no specification.

Guaranteed QoS class

If you specify both requests and limits and the values are the same, you get your pod into the guaranteed QoS class and kubernetes will just set aside the resources it need and allocate it. If in addition your CPU shares are at least 1, it will pin it to a CPU! And this is what I did with Plex. Whenever I was transcoding stuff in Plex, it’d just go ahead and grab whatever resources it could, thereby affecting my other workloads and kubernetes stability even.

So, I have pinned it to a CPU, with the following:

 resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi

It’s a bit wasteful, it doesn’t normally need one CPU, but right now it’s not an issue to give it a virtual CPU. I’ll probably try out how to give it less, but I saw much better results if I just let it keep one CPU and not affect anything else.

Keep an eye of your totals

In k3s configuration, I set away 2 virtual CPUs reserved, but burstable and best effort PODs may still use them, if they have resources for it. Guaranteed PODs will not, however, they will be put on the other CPUs.

Since I have 8 virtual CPUs in total, that means I have 6 CPus to give out in requests. I gave Plex a full CPU, so there’s only 5 left for the others, which is why I’ll at some point try a different way to tame Plex.

You don’t typically give a POD the maximum it asks for guaranteed, but if it’s part of the workload you care about, you probably want to guarantee it some.

One idea can be to use some prometheus metrics and grafana dashboards and see what’s the typical CPU usage for a POD, and then just give it that. Some PODs will generally not use CPU at all, but if they are important, you should give it at least some cpu. Setting requests as low as 10m is totally fine, at least it should be able to do its job when it needs it.

Limits can be overcommited, though, and hopefully not all PODs will try to use the full CPU and memory they are allowed at the same time. But if they do, at least the scheduler should give the PODs the values you have set in requests. At least in theory, it’s never a good idea to run a system at full capacity all the time….

Summary

With all of this, your workloads should be able to do their job and hopefully, kubernetes stability in itself will also benefit from it. In my case, it did.

With good CPU management and requests and limits set, your important workloads like wordpress and nextcloud will be able to do their job even if other components are competing for CPU.

This is a lot larger topic than I have played with here, there’s kernel boot settings I haven’t experimented with, and OSs scheduler of course affects this in a major way. If the OS scheduler is crap, there’s not a whole lot Kubernetes can do to help it! Please let me know in the comments if you have found some ideal settings, especially for Linux! (but of course also for other OSes)


Legg igjen en kommentar

Din e-postadresse vil ikke bli publisert. Obligatoriske felt er merket med *

This site uses Akismet to reduce spam. Learn how your comment data is processed.