Resolving node resource exhaustion in Kubernetes clusters

Resource exhaustion at a node remains a critical issue. However, the automation of deployment and management of containerized applications is executed relatively efficiently in Kubernetes.

When a node is low on resources—as in CPU, memory, or storage—a workload may suffer from failures, degraded performance, and eviction.

If you want your cluster to run smoothly, it's time to learn how to identify the root causes of your node resource exhaustion and take proactive steps to mitigate them before something gets out of hand.

What is a Kubernetes node?

A Kubernetes node is similar to a worker machine that runs containerized applications in a Kubernetes cluster.

A node is categorized as a physical or virtual depending on where the cluster is deployed. A cluster contains physically separated nodes, and every cluster has a control plane that optimally schedules workloads to balance performance across the nodes. This can greatly impact the deployment of the application and the reliability of the Kubernetes infrastructure.

Types of nodes in Kubernetes

Master node (Control plane node): Manages the cluster and handles scheduling, state management, and control of worker nodes.
Runs essential components like the API server, scheduler, controller manager, and etcd (cluster data store).
Worker nodes: These nodes run the actual workloads and host the application containers.
Each worker node contains a kubelet (agent), container runtime, and kube-proxy (network manager).

Cluster health is affected when the node is not functioning properly. The most common reason for node failure is resource contention or exhaustion.

Now that you understand the importance of Kubernetes nodes, it's time to discuss the common triggers that cause node resource exhaustion..

Triggers of Kubernetes node resource exhaustion

One of the pivotal factors of node resource exhaustion is over-provisioned or misconfigured workloads. Specifically, when applications try to consume too much CPU or memory, this can result in contention for system resources, which in turn can lead to performance problems. Other applications may have memory leaks or fail to use system resources effectively.

The high resource consumption by system daemons also adds to the problem. Kubelet, container runtime, and monitoring agents are examples of critical components that consume node resources. In addition, logging and security agents can contribute excessive data, which if not properly controlled, can result in storage exhaustion.

Compounding this, poor workload scheduling leads to a state where some nodes are heavily loaded while others are almost idle. A poorly scheduled cluster will surely perform badly. Moreover, some persistent volume (PV) and disk pressure conditions like excessive log files or leftover container images can cause disk space exhaustion to reach the level where the cluster is not stable.

Strategies for resolving and preventing Kubernetes node resource exhaustion

The following are the industry-approved strategies for preventing node resource exhaustion that will save your time and fortune:

Set resource requests and limits

Appropriately defined CPU and memory requests allow Kubernetes to allocate pods optimally and avoid excessive resource utilization by individual pods, which can be detrimental to other workloads' performance. Setting resource limits also helps enforce fair allocation by preventing monopolization of node resources.

apiVersion: v1

kind: Pod

metadata:

name: example-pod

spec:

containers:

- name: app-container

image: myapp:v1

resources:

requests:

cpu: "500m"

memory: "256Mi"

limits:

cpu: "1"

memory: "512Mi"

Implement node autoscaling

Kubernetes allows for resource implementation and usage modification with demand autoscaling. As workloads increase or decrease, cluster autoscalers are able to add or remove nodes to ensure resources are available.

While adjusting pod self-deployment to meet server traffic, HorizontalPod Autoscaler (HPA) adjusts pod replicas resource consumption. Resource consumption adjustments are handled by Vertical Pod Autoscaler (VPA) through modifying CPU and memory requests for individual pods.

Scale your deployment with the following command:

kubectl scale deployment nginx --replicas=5

This will scale the deployment named nginx to 5 replicas.

Monitor and optimize system daemons

Tracking resource usage of system daemons is essential to maintaining node efficiency. Optimize your background processes like monitoring agents, logging tools, and security components to consume minimal resources. Tools like Site24x7 Kubernetes monitoring help identify excessive resource consumption by system daemons, enabling fine-tuned optimizations and also suggesting best practices that would help avoid over or under-utilization.

Employ node affinity, taints, and tolerations

By ensuring a balanced workload distribution throughout the cluster, node affinity lowers the possibility of overloading particular nodes. Use taints and tolerations as they help prevent critical workloads from being scheduled on overloaded nodes.

The below command will add taint to the node that prevents any pods that don't have a matching toleration from being scheduled:

kubectl taint nodes node-name type=production:NoSchedule

The following is the toleration for the above taint:

spec:

tolerations:

- key: "type"

operator: "Equal"

value: "production"

effect: "NoSchedule"

This configuration means that the Pod can be scheduled on the nodes with type=production taint.

Managing storage and disk usage

If node resources are to be finely managed, then efficient storage management would be crucial. Excessive disk usage is prevented through regular log rotation while associating size limits with EmptyDir volumes aids in guaranteeing that temporary storage doesn't overwhelm nodes. Container image and temp file pruning also enhances storage efficiency.

Persistent Volumes (PVs) help manage storage resources separately from pods.

And Storage Classes allow dynamic provisioning of storage based on defined policies.

Consider this example of a Storage Class:

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

name: standard

provisioner: kubernetes.io/aws-ebs

parameters:

type: gp2

reclaimPolicy: Retain

allowVolumeExpansion: true

Use Persistent Volume Claims (PVCs) to request storage from a Storage Class:

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

name: myclaim

spec:

accessModes:

- ReadWriteOnce

storageClassName: standard

resources:

requests:

storage: 8Gi

Optimize scheduling with resource-aware policies

By ensuring that workloads are dispersed uniformly, Topology Spread Constraints help to avoid overburdening particular nodes.

apiVersion: v1

kind: Pod

metadata:

name: example-pod

spec:

containers:

- name: example-container

image: myimage

topologySpreadConstraints:

- maxSkew: 1

topologyKey: zone

whenUnsatisfiable: DoNotSchedule

labelSelector:

matchLabels:

app: example-app

Benchmarking resource usage at the Nodes also provides very useful data related to intelligent scheduling decisions. Following these guidelines positively affects the reliability and performance of the cluster.

Leverage proactive monitoring and alerting

Make use of active and real-time monitoring tools, such as Site24x7 Kubernetes monitoring , and get insight into how much CPU, memory, and storage are being used. Setting alerts based on resource deadline threshold values can ensure that any issues can be solved or tackled immediately. By staying proactive, teams can prevent resource exhaustion and maintain a high-performance Kubernetes environment.

In conclusion

By this time, you will know that Kubernetes node resource exhaustion can lead to application downtime and degraded performance in clusters. To tackle this implement resource requests and limits, enable autoscaling, manage storage, and optimize workload scheduling. Thus, you can ensure high availability and efficiency of your Kubernetes environment.

Leveraging monitoring tools like Site24x7 Kubernetes monitoring will allow you to detect and resolve resource issues before they escalate, keeping your Kubernetes clusters healthy and resilient.

Topic Participants
Grace Nalini

Customer Self-Service Portal

Resolving node resource exhaustion in Kubernetes clusters

What is a Kubernetes node?

Types of nodes in Kubernetes

Triggers of Kubernetes node resource exhaustion

Strategies for resolving and preventing Kubernetes node resource exhaustion

Set resource requests and limits

apiVersion: v1

kind: Pod

metadata:

name: example-pod

spec:

containers:

- name: app-container

image: myapp:v1

resources:

requests:

cpu: "500m"

memory: "256Mi"

limits:

cpu: "1"

memory: "512Mi"

Implement node autoscaling

kubectl scale deployment nginx --replicas=5

Monitor and optimize system daemons

Employ node affinity, taints, and tolerations

kubectl taint nodes node-name type=production:NoSchedule

spec:

tolerations:

- key: "type"

operator: "Equal"

value: "production"

effect: "NoSchedule"

Managing storage and disk usage

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

name: standard

provisioner: kubernetes.io/aws-ebs

parameters:

type: gp2

reclaimPolicy: Retain

allowVolumeExpansion: true

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

name: myclaim

spec:

accessModes:

- ReadWriteOnce

storageClassName: standard

resources:

requests:

storage: 8Gi

Optimize scheduling with resource-aware policies

apiVersion: v1

kind: Pod

metadata:

name: example-pod

spec:

containers:

- name: example-container

image: myimage

topologySpreadConstraints:

- maxSkew: 1

topologyKey: zone

whenUnsatisfiable: DoNotSchedule

labelSelector:

matchLabels:

app: example-app

Leverage proactive monitoring and alerting

In conclusion

Topic Participants

Grace Nalini