Secure Chaos Engineering on Kubernetes clusters without being a noisy neighbor

Kubernetes is a powerful open source platform to build scalable, reliable systems, designed to be extensible and customizable for many use cases. Kubernetes provides the ability to scale individual pods, swap out runtimes, and control access to objects using namespaces. Each of these features adds higher utilization, customizability, and better security along with added complexity.

In order to maximize the benefits of these features, we rely on testing to ensure they are utilized correctly and that our application will perform as intended. 55% of Gremlin customers use Kubernetes today and rely on Gremlin’s Kubernetes object autodiscovery to precisely inject failure into individual pods, deployment sets, or nodes, even with ephemeral objects, in order to improve the reliability of their workloads. We've made this capability more precise, accessible, and secure with three major updates to our platform:

Gremlin now isolates its resource attacks to a single container. Customers can now confidently take advantage of the scalability and cost savings of horizontal pod autoscaling (HPA) and the noisy neighbor prevention of resource limits.
We now support containerd and CRI-O container runtimes. Customers can now perform experiments on the latest versions of the second and fourth most popular managed Kubernetes platforms Amazon EKS and OpenShift.
We’ve added fine-grained namespace access control. This restricts users’ ability to attack objects on a namespace by namespace basis, without access to the underlying node, providing more control in shared cluster environments.

Gremlin makes Chaos Engineering easy and seamless. For us, it’s cut down the amount of time involved in designing and executing the chaos experiments, particularly for our Microservices and Kubernetes.

Chaitanya Krant

Engineering Manager at National Australia Bank

Confidently take advantage of higher resource utilization

Kubernetes allows for packing multiple pods onto a single node and scaling out each pod individually without impacting neighboring pods. Horizontal Pod Autoscaling (HPA) helps squeeze more utilization out of your infrastructure by scaling out only pods that have reached their resource limits, saving costs versus scaling out entire applications. Resource Limits prevent containers from over-utilizing resources and disrupting other services that share a node. However, if applications aren’t tested for HPA and resource limits, it’s difficult to determine if your application is decoupled enough to scale out pods independently and to know if noisy neighbors can still break services sharing the same node.

Testing these limits requires isolating resource attacks’ impact to a single container, without directly impacting other containers on the host. Using cgroups, Gremlin can now test your HPA and resource limit policies. Setting up HPA and Resource Limits can be difficult, where improper setups cause unnecessary throttling or scale downs that are too fast. To run a test on these features, run a resource attack as you have before, and target a pod and the containers in that pod you want to target.

During the experiment, watch what happens to your pods when Gremlin competes for the same resources defined in your pod spec limits.

Check if your HPA is working as expected. Does Kubernetes scale out additional pods of the single, highly utilized service, or does it scale out multiple services unexpectedly?
Do end users experience any latency or error rates during the scale out?
Do pods scale down gracefully when resources are no longer stressed?
Watch how other pods handle a noisy neighbor. Do they start to increase error rates or latency? Do pods get moved to other nodes?
Verify that your pod level monitoring systems do what they should. Are alerts triggered at the resource thresholds you expect?

Chaos Engineering available on more platforms

Gremlin enables customers using major managed Kubernetes providers to improve the reliability of their systems. We support earlier versions of Amazon EKS and OpenShift for years using Docker, but in order to support the latest versions, we added containerd and CRI-O support. containerd is a runtime that manages runc. In fact, Docker is an abstraction layer on top of containerd. CRI-O is a lightweight alternative to containerd, designed for faster, stable workloads. Now Gremlin supports all three popular runtimes and all of the most popular managed Kubernetes providers.

By supporting these additional runtimes, customers can now run attacks across their environment, even if it’s mixed, using a single UI and API. This makes testing heterogeneous environments even easier. Similar to our Docker support, Gremlin provides automatic object detection and blast radius control, so you can precisely target the part of your containerd and CRI-O based deployments for chaos experiments.

Added security with fine-grained target control

Many of our customers are running large scale clusters in multi-tenant environments. As a part of our commitment to security, Gremlin added namespace access control. Namespaces in Kubernetes isolates objects across teams, where objects can only exist in one namespace, and access to each namespace can be controlled. Gremlin now provides controls to limit who has access to run chaos experiments on clusters or pods running in their namespaces. By adding fine-grained access control focused on teams, administrators can maintain the logical separation that namespaces provide to the users who are performing attacks.

Making Kubernetes more reliable. Securely.

Kubernetes provides many tools to build resilient systems. It’s up to us to tune them to extract the most value out of our infrastructure while remaining reliable. We can’t just assume that our applications are designed to handle the unique problems that services managed by Kubernetes face while scaling to meet demand or handling shared resources and clusters. We need to test our configurations with Chaos Engineering to find the right configurations that are best for our workloads.

These three updates are a part of our efforts to ensure that no matter what Kubernetes platform you use, Gremlin will help you simply, safely, and securely prepare them for failure. Request a demo for Gremlin and try Chaos Engineering on your Kubernetes workloads today!

December 19, 2023 - 5 min read

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up…

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it…