We're excited to introduce a new enhancement to help teams build more reliable software: Detected Risks.
Available today, Detected Risks helps you find and fix the most common causes of infrastructure outages and incidents in minutes—without running Chaos Engineering experiments or reliability tests.
Reliability Risks Are Everywhere
Our digital infrastructure is as important as our physical infrastructure. Government, healthcare, transportation, communication and finance all rely on this digital foundation. However, these systems are fraught with vulnerabilities that threaten uptime and availability–and ultimately disrupt users' lives.
Working closely with our customers, we’ve found critical reliability risks present in nearly every organization, such as lack of zone redundancy and misconfigured autoscaling. For example, our data shows 26% of deployments have zero zone redundancy and 80% fail to utilize more than two zones (a best practice recommended in the AWS Well-Architected Framework). These prevalent misconfigurations, despite often being trivial to fix if they are known, can have serious impacts on system reliability. In this case, if one zone is unavailable–a common occurance–a full quarter of deployments would be offline.
Testing for These Risks Can be Challenging
We’ve long advocated for testing that identifies these risks before users experience them as incidents. Typically, this testing required teams to perform Chaos Engineering experiments and fault injection. This testing, and the improvements made as a result of it, is one of the most high-impact activities organizations can do to improve system availability. But we’ve seen that teams looking to do this work often face significant headwinds getting started in risk-averse organizations.
So we asked, “What if you could find the risks without the headwinds?”
That’s Detected Risks.
Detected Risks Removes the Barriers to Find and Fix Risks
Our industry has many bright SREs working hard to personally mitigate these issues, but that approach doesn't scale. We are solving this problem by building something easy to use that provides valuable insight across thousands of real-world applications. Providing engineering leadership with visibility into existing risks helps them prioritize and accomplish this important work so that they can continue to protect the customer experience and build high quality software.
In other words, Detected Risks gives teams a way to start making real reliability improvements faster than ever. Here’s what we’re most excited about:
Immediate Risk Insights
Starting day one, Detected Risks identifies potential failure points and guides teams toward remedies in minutes with minimal configuration. There’s just three steps:
- Install the Gremlin agent
- Add your services to Gremlin, either automatically with annotations or manually
- Gremlin will detect your risks and show them in the UI
You’ll be able to see which risks are already present in your system, and track the ones you remediate to show progress to your organization.
An Extensive (and Expanding) Risk Library
With a core set of six Detected Risks for Kubernetes available at launch, and 20 more before year-end, Detected Risks will continually evolve to find and fix the reliability risks hidden in configurations.
Here’s what’s available today:
- CPU Requests
- Liveness Probes
- Availability Zone Redundancy
- Memory Requests
- Memory Limits
- Application Version Uniformity
Read the docs for more details on each risk.
A Strong Foundation for Further Testing
By finding and fixing low-hanging risks early, teams using Gremlin can demonstrate progress to their organizations. These early steps are critical steps on the path toward more extensive testing and proactive reliability management practices across their organizations.
After finding and fixing your initial set of risks, it’s easy to roll Gremlin out to additional services, improve your reliability score on each service, and take advantage of Gremlin’s entire suite of tools to modernize reliability at enterprise scale.
A Step Toward a More Reliable Future
Our mission at Gremlin is to enable every business to build more reliable software, and Detected Risks represents a significant step toward fulfilling that promise. We're leveraging our experience working with many of the largest companies in the world to deliver a product that meets the most pressing reliability needs from the moment you install Gremlin.
We’ll be introducing new Detected Risks regularly. You can expect our initial suite to expand throughout the year to identify the most pressing risks across your stack.
Of course, we’re only at the start of delivering on this mission: look for more announcements on how Gremlin can help your organization modernize your reliability practices in the coming months, or get a demo of what we’ve been working on recently, including standardized reliability tests and scoring, automated service and dependency discovery, and more.
Getting Started with Detected Risks
Detected Risks is available today for all Gremlin users, even those on our free trial–no credit card required.
You can sign up for a free account, install the Gremlin agent, configure our services and see your detected risks in minutes.
For more details:
- Watch a 2-minute demo of Detected Risks in action below
- Follow the tutorial to set it up in your Kubernetes environment
- Read the docs