april 25, 2025 / kubernetes,security,cgroups,linux,platform engineering

cgroup-based security in kubernetes

cgroups are the backbone of container resource isolation in kubernetes. understanding how they work, and where they fail, is essential for building secure clusters.

every container in your kubernetes cluster is just a linux process. same kernel, same hardware, same pid space as everything else on the node. the thing that stops one container from eating all available memory and taking down the entire node? cgroups.

most people know cgroups exist. fewer people configure them properly. and almost nobody monitors them. that’s a problem, because misconfigured cgroups are behind most of the “my node just died and i don’t know why” incidents we get called in to debug.

the short version

cgroups - control groups - let you put processes into groups and cap what resources each group can use: cpu, memory, disk i/o, pid count. when kubernetes runs a container, it creates a cgroup for it and translates your pod spec’s resources.requests and resources.limits into real kernel constraints.

set limits.memory: "512Mi" and the kernel will oom-kill the process if it crosses that line. set limits.cpu: "2" and the scheduler throttles it after 200ms per 100ms period. these aren’t suggestions - they’re hard enforcement at the kernel level.

the catch is what happens when you don’t set them.

what goes wrong without limits

a container without memory limits has no memory.max in its cgroup. it can allocate freely until the node runs out of memory entirely. at that point the kernel’s oom killer picks something to terminate - and it might pick kubelet. once kubelet is gone, the node is unmanageable from the control plane.

we’ve seen this happen in production more than once. a memory leak in a sidecar container, running besteffort because someone forgot the resource block. it ate 90% of node memory over six hours. when the oom killer finally fired, it took out the container runtime instead of the leaking process. every pod on the node died simultaneously.

the same pattern plays out with pids. kubelet’s --pod-max-pids defaults to unlimited. a fork bomb - intentional or accidental - fills the pid table and suddenly nothing on the node can spawn new processes. health checks fail, graceful shutdowns fail, and you’re ssh-ing into the node to manually kill things.

qos classes and the eviction order

kubernetes groups pods into three quality-of-service classes based on how you configure resources, and this directly determines which cgroup subtree they land in:

/sys/fs/cgroup/kubepods.slice/
├── kubepods-guaranteed.slice/
├── kubepods-burstable.slice/
└── kubepods-besteffort.slice/

guaranteed means requests equal limits for both cpu and memory. the kernel protects these pods most aggressively - they’re last to be evicted, and their memory is protected from reclaim via memory.min.

besteffort means no resource spec at all. these pods have zero protection. under memory pressure, they die first. and since they have no memory.max, they can be the cause of that pressure while having no defense against the consequences.

the implication: if you run any besteffort pods alongside anything important, you’ve created a situation where the unprotected workload can kill the protected one’s node.

hardening

the kubelet configuration that matters most:

cgroupDriver: systemd
enforceNodeAllocatable: [pods, system-reserved, kube-reserved]
systemReserved:
  cpu: "500m"
  memory: "1Gi"
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
podPidsLimit: 4096

enforceNodeAllocatable is the one people miss. without it, there’s no cgroup boundary protecting kubelet and the container runtime from pod resource pressure. systemReserved and kubeReserved carve out space that pods can never touch. at the namespace level, always set a LimitRange:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
spec:
  limits:
    - default:
        cpu: "1"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

this catches every container that ships without a resource block. it won’t fix bad limits, but it eliminates the “no limits at all” failure mode.

for monitoring, cgroup v2’s pressure stall information (psi) is the signal to watch. cadvisor exposes it as prometheus metrics. a rising some on cpu means workloads are being throttled. non-zero full on memory means the kernel is actively reclaiming pages from that cgroup. these show up before the crash, which is the whole point.

what cgroups don’t cover

cgroups handle resource exhaustion. that’s it. they don’t stop a process from exploiting a kernel vulnerability, scanning the network, or reading data it shouldn’t have access to. for those you need namespaces, seccomp profiles, network policies, and pod security standards.

but resource exhaustion is still the most common way a container takes down a node. get cgroups right and you eliminate the cheapest, most frequent class of incidents.

references

[1] Linux Kernel Documentation. “Control Group v2.” docs.kernel.org/admin-guide/cgroup-v2.html

[2] Kubernetes Documentation. “Managing Resources for Containers.” kubernetes.io/docs/concepts/configuration/manage-resources-containers

[3] Kubernetes Documentation. “Reserve Compute Resources for System Daemons.” kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources

[4] Kubernetes Documentation. “Pod Quality of Service Classes.” kubernetes.io/docs/concepts/workloads/pods/pod-qos

[5] Kubernetes Documentation. “Limit Ranges.” kubernetes.io/docs/concepts/policy/limit-range

[6] Kubernetes Documentation. “Process ID Limits and Reservations.” kubernetes.io/docs/concepts/policy/pid-limiting