Release Roundup Sept 2023: Measurably improve reliability

It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.

New features and UI updates

Find hidden reliability risks without fault injection

The headline feature this month is the introduction of Detected Risks. This new capability automatically detects high-priority reliability concerns in a Kubernetes environment—without running any reliability tests or chaos experiments. You can look forward to dozens more risks being added by the end of this year.

A demo showing Gremlin's Detected Risks feature

Run experiments on serverless workloads

We also launched the beta release of Failure Flags, Gremlin's new framework for running Chaos Engineering experiments on fully managed platforms such as AWS Lambda functions, serverless workloads, and containers. Teams can now run chaos experiments where access to the underlying infrastructure is limited, or simulate failures at the application layer that aren’t possible at the infrastructure layer. It also means Gremlin can now run across your entire stack—even if it’s managed for you.

Get a clearer view of reliability with better reporting

Also this month, we improved Company Summary reports (previously called the Dashboard). You can now see summary reports of both your Detected Risk reports and Reliability Score reports, so you can get a sense of your reliability posture in one place. As part of this change, plan usage details have been moved to Company Settings.

Additional improvements

In other news, we’ve made a number of general improvements:

Gremlin now supports delegation of Namespaces to a Team for both manual and automatic service creation. Teams can more confidently run experiments without accidentally impacting other teams' resources.
We’ve added service annotations, which lets you automatically register your Kubernetes services in Gremlin by adding a simple annotation. This speeds up the process of service creation significantly: any service with an annotation simply appears in the Gremlin Service Catalog, ready for you to manage and test.
We’ve added web app support for managing multiple services simultaneously. This lets you add Health Checks to multiple services with a single click and start testing within seconds. The Service Catalog has been reworked to reflect this change.
Scenarios can now be deleted in addition to being archived, so now you only need to see your most relevant Scenarios.

Agent Updates

Better performance for Linux agents

We’ve made two significant improvements to the Linux agent, both of which reduce network overhead and improve overall performance.

First, Gremlin now uploads discovered process data at a slower rate, reducing network overhead.

Second, gremlind now batches up process data over 15 minute intervals, deduplicating all network and process data detected over this interval. Previously, gremlind would emit snapshots of process and socket data to Gremlin's control plane over two minute intervals.

Enabling Detected Risks

Noted above, Gremlin can now detect specific reliability risks without fault injection. To support this functionality, the Chao Kubernetes agent now sends the imageID of each container, which enables Gremlin to identify services running multiple container versions simultaneously—a common reliability risk. You can learn more about Detected Risks here.

Security improvements

We continue to build out enterprise-grade security capabilities trusted by some of the world’s largest and most regulated companies, and this month we’ve made two updates.

First, when installed directly on the host and launched with SystemD, the Gremlin agent now runs with ambient capabilities (capabilities(7)) rather than file capabilities. Ambient capabilities allow the Gremlin agent to retain certain permissions even after it has started, making it more flexible and secure in a Linux environment.

Second, when installed directly on the host, the suid bit is no longer set for installed binaries /usr/bin/gremlin and /usr/sbin/gremlind. Additionally, these binaries are no longer owned by the Gremlin linux user, but instead by root, which allows a user to run things as if they were being run by the owner while improving security.

Certificate Expiry test improvements

Running Certificate Expiry experiments against CIDR values (e.g., 10.0.0.0/24) will make several attempts to find an active IP address in use by the target system for evaluating certificate expiration characteristics within the duration specified by the argument --length.

Improved labeling

With Helm, you can now add labels to the deployed Gremlin Pods using the chao.podLabels and gremlin.podLabels parameters. Labels make it easier to filter, sort, or select pods for tests and experimentation in Gremlin. See the Chart documentation for details.

Try it for yourself

If you already have a Gremlin account, everything noted here is already available to you, as long as you have the latest agent installed.

If not, sign up for a free trial to start understanding and improving your reliability posture in minutes.

December 19, 2023 - 5 min read

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up…

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it…