State of Chaos Engineering 2021

2021
state of
Chaos Engineering

In 2016, Matthew Fornaciari and Kolton Andrus co-founded Gremlin with a simple mission: Build a more reliable internet. We are ecstatic to see how far the practice of Chaos Engineering has come, and are proud to share the results of the inaugural State of Chaos Engineering report that emphasizes the importance of the practice in improving availability.

Key findings

Increased availability and decreased MTTR are the two most common benefits of Chaos Engineering

Teams who frequently run Chaos Engineering experiments have >99.9% availability

23% of teams had a mean time to resolution (MTTR) of under 1 hour and 60% under 12 hours

Network attacks are the most commonly run experiments, in line with the top failures reported

While still an emerging practice, the majority of respondents (60%) have run at least one Chaos Engineering attack

34% of respondents run Chaos Engineering experiments in production

Things break

From the survey, the top 20% of respondents had services with an availability of more than four nines, an impressive level. 23% of teams had a mean time to resolution (MTTR) of under an hour, with 60% having an MTTR of under 12 hours.

What is the average availability of your service(s)?

<=99%
99.5%-99.9%
>=99.99%

Average number of high severity incidents (Sev 0&1) per month

1-10
10-20

What is your mean time to resolution (MTTR)?

<1 hour
1 hour - 12 hours
12 hours - 1 day
1 day - 1 week
> 1 week
I don't know

When things do break, the most common causes were bad code pushes and dependency issues. These are not mutually exclusive. A bad code push from one team can cause a service outage for another. In modern systems where teams own independent services, it’s important to test all services for resiliency to failure. Running network-based chaos experiments, such as latency and blackhole, ensures that systems are decoupled and can fail independently, minimizing the impact of a service outage.

What percent of your incidents (SEV0&1) have been caused by:

39%

24%

19%

11.8%

—6.1%

Bad code deploy (e.g., a bug in code deployed to production leading to an incident)

41%

25%

20%

10.1%

—3.7%

Internal dependency issues (non-DB) (e.g., a service operated by your company had an outage)

48%

23%

14%

10.1%

—5.2%

Configuration error (e.g., wrong settings in your cloud infrastructure or container orchestrator causing an incident)

50%

19%

13%

15.7%

—1.7%

Networking issues (e.g., ISP or DNS outage)

48%

23%

13%

14.3%

—1.7%

3rd party dependency issues (non-DB) (e.g., lost connection to a payment processor)

61%

14%

12.5%

—3.9%

Managed service provider issues (e.g., cloud provider AZ outage)

64%

14%

12%

—4.4%

Machine/infrastructure failure (on-prem) (e.g., a power outage)

58%

18%

5.2%

—1.2%

Database, messaging, or cache issues (e.g., lost DB node leading to an incident)

66%

10%

15%

7.4%

—1%

Unknown

<20%
21-40%
41-60%
61-80%
>80%

Who finds out

Monitoring for availability varies by company. For example, Netflix’s traffic is so consistent, they can use video starts per second from the server-side to spot an outage. Any deviation from the projected pattern signals an outage. Google uses Real User Monitoring mixed with windowing to determine if a single outage had a large impact or if multiple small incidents are impacting a service, leading to deeper analysis of the cause of the incident(s). Few companies have consistent traffic patterns and sophisticated statistical models like Netflix and Google. That’s why a standard uptime over total time using synthetic monitoring sits at the top as the most popular way to monitor the uptime of services, while many organizations use multiple methods and metrics. We were pleasantly surprised that all of the respondents are monitoring availability. This is often the first step teams take to get proactive about improving customer experiences in applications.

What metric do you use to define availability?

Error Rate (Failed requests/total requests)
Latency
Orders/transactions vs historical predication
Successful requests/total requests
Uptime/total time period

How do you monitor availability?

Real user monitoring
Health checks / synthetics
Server-side responses

When looking at who receives reports about availability and performance, it was no surprise that the closer a person is to operating applications, the more likely they are to receive reports. We believe the trend of DevOps bringing Operations and Development closer together is bringing the developer in line with Ops as the mindset of build and operate becomes pervasive in organizations. We also believe that as digitization increases and online user experience becomes more paramount, we’ll see an increase in the percent of C-level staff that receive availability and performance reports.

Who monitors or receives reports on availability?

CEO
CFO or VP of Finance
CTO
VP
Managers
Ops
Developers
Other

Who monitors or receives reports on performance?

CEO
CFO or VP of Finance
CTO
VP
Managers
Ops
Developers
Other

Top performers

Top performers had 99.99%+ availability and an MTTR of under one hour (highlighted above). In order to achieve these impressive numbers, we looked into what tooling teams used. Notably, autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks were all more common in the top availability group. Some of these, such as multi-zone, are expensive, while others, such as circuit breakers and select rollouts, are a time and engineering expertise issue.

Teams who consistently run chaos experiments have higher levels of availability than those who have never performed an experiment, or do so ad-hoc. But ad-hoc experiments are an important part of the practice, and teams with >99.9% availability are performing more ad-hoc experiments.

Frequency of Chaos Engineering experiments by availability

Never performed an attack
Performed ad-hoc attacks
Quarterly attacks
Monthly attacks
Weekly attacks
Daily or more frequent attacks

Tool use by availability

Autoscaling

<99%
99%-99.9%
>99.9%

DNS failover/elastic IPs

<99%
99%-99.9%
>99.9%

Load balancers

<99%
99%-99.9%
>99.9%

Active-active multi-region, AZ or DC

<99%
99%-99.9%
>99.9%

Active-passive multi-region, AZ, or DC

<99%
99%-99.9%
>99.9%

Circuit breakers

<99%
99%-99.9%
>99.9%

Backups

<99%
99%-99.9%
>99.9%

DB replication

<99%
99%-99.9%
>99.9%

Retry logic

<99%
99%-99.9%
>99.9%

Select rollouts of deployments (Blue/Green, Canary, feature flags)

<99%
99%-99.9%
>99.9%

Cached static pages when dynamic unavailable

<99%
99%-99.9%
>99.9%

Monitoring with health checks

<99%
99%-99.9%
>99.9%

Evolution of Chaos Engineering

In 2010, Netflix introduced Chaos Monkey into their systems. This pseudo-random failure of nodes was a response to instances and servers failing at random. Netflix wanted teams prepared for these failure modes, so they accelerated the process to demand resiliency to instance outages. It created both a test for reliability mechanisms and forced developers to build with failure in mind. Based on the success of the project, Netflix open sourced Chaos Monkey and created a Chaos Engineer role. Chaos Engineering has evolved since then to follow the scientific process, and experiments have expanded beyond host failure to test for failures up and down the stack.

Google searches for "Chaos Engineering"

2016
2017
2018
2019
2020

For every dollar spent in failure, learn a dollar’s worth of lessons

“Master of disaster”

Jesse Robbins

In 2020, Chaos Engineering went mainstream and made headlines in Politico and Bloomberg. Gremlin hosted the largest Chaos Engineering event ever, with over 3,500 registrants. Github has over 200 Chaos Engineering related projects with 16K+ stars. And most recently, AWS announced their own public Chaos Engineering offering, AWS Fault Injection Simulator, coming later this year.

Chaos Engineering today

Chaos Engineering is becoming more popular and improving: 60% of respondents said they have run a Chaos Engineering attack. Netflix and Amazon, the creators of Chaos Engineering, are cutting edge, large organizations, but we’re also seeing adoption from more established organizations and smaller teams. The diversity of teams using Chaos Engineering is also growing. What began as an engineering practice was quickly adopted by Site Reliability Engineering (SRE) teams, and now many platform, infrastructure, operations, and application development teams are adopting the practice to improve the reliability of their applications. Host failure, which we categorize as a State type attack, is far less popular than network and resource attacks. We’ve seen an uptake in simulating lost connections to a dependency or a spike in demand for a service. We’re also seeing many more organizations moving their experimentation to production, although this is in the early days.

459,548

attacks using the Gremlin platform

68%

of customers using K8s attacks

How frequently does your organization practice Chaos Engineering?

>10,000 employees

5,001-10,000 employees

1,001-5,000 employees

100-1,000 employees

<100 employees

Daily or more frequent attacks
Weekly attacks
Monthly attacks
Quarterly attacks
Performed ad-hoc attacks
Never performed an attack

What teams are involved in conducting chaos experiments?

Application Developers
C-level
Infrastructure
Managers
Operations
Platform or Architecture
SRE
VPs

What percentage of your organization uses Chaos Engineering?

76%+
51-75%
26-50%
<25%

What environment have you performed chaos experiments on?

Dev/Test
Staging
Production

Percent of attacks by type

Network
Resource
State
Application

Percent of attacks by target type

Host
Container
Application

Results of chaos experiments

One of the most exciting and rewarding aspects of Chaos Engineering is discovering or verifying a bug. The practice makes it easier to uncover unknown issues before they impact customers and identify the real cause of an incident, speeding up the patching process. Another major benefit that showed up in the write-in response to our survey was a better understanding of architectures. Running chaos experiments helps identify where there is tight coupling or unknown dependencies that adversely affect our applications and often remove many of the benefits of creating microservices applications. From our own product, we found that customers were frequently identifying incidents, mitigating the issue, and verifying the fixes with Chaos Engineering. Our survey respondents frequently found their applications increased in availability while they reduced their MTTR.

After using Chaos Engineering, what benefits have you experienced?

Increased availability
Reduced mean time to resolution (MTTR)
Reduced mean time to detection (MTTD)
Reduced # of bugs shipped to production
Reduced # of outages
Reduced # of pages

Future of Chaos Engineering

What is the biggest inhibitor to adopting/expanding Chaos Engineering?

Lack of awareness
Other priorities
Lack of experience
Lack of time
Security concerns
Fear something might go wrong

The biggest inhibitors to adopting Chaos Engineering are a lack of awareness and experience. These are followed closely by ‘other priorities’ but interestingly more than 10% mentioned the fear that something might go wrong was also a prohibitor. It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences.

We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments. As the practice matures and tooling evolves, we expect it to be more accessible and faster for engineers and operators to design and run experiments to improve the reliability of their systems across environments - today, 30% of respondents are running chaos experiments in production. We believe that chaos experiments will become more targeted and automated, while also becoming more commonplace and frequent.

We’re excited about the future of Chaos Engineering and its role in making systems more reliable.

Demographics

The data sources for this report include a comprehensive survey with 400+ responses and Gremlin’s product data. Survey respondents are from a range of company sizes and industries, primarily in Software and Services. Adoption of Chaos Engineering has hit the enterprise, with nearly 50% of respondents working for companies with more than 1,000 employees, and nearly 20% working for companies with more than 10,000 employees.

The survey highlighted a tipping point in cloud computing, where nearly 60% of respondents ran a majority of their workloads in the cloud, and used a CI/CD pipeline. Containers and Kubernetes are reaching a similar level of maturity, but the survey confirmed that service meshes are still in their early days. The most common cloud platform is AWS at nearly 40%, with GCP, Azure, and on-premises following around 11-12%.

400+

Qualified respondents

How many employees work at your company?

>10,000
5,001-10,000
1,001-5,000
100-1,000
<100

How old is your company?

Over 25 years old
10 to 25 years old
2 to 10 years old
Less than 2 years old

What industry is your company in?

Software & Services
Banks, Insurance & Financial Services
Energy Equipment & Services
Retail & eCommerce
Technology Hardware, Semiconductors, & Related Equipment

What is your job title?

Software Engineer
SRE
Engineering Manager
System Administrator
Non-technical Executive (ex: CEO, COO, CMO, CRO)
Technical Executive (ex: CTO, CISO, CIO)

What percent of production workloads are in the cloud?

>75%
51-75%
25-50%
<25%

What percent of production workloads are deployed using a CI/CD pipeline?

>75%
51-75%
25-50%
<25%

What percent of production workloads use containers?

>75%
51-75%
25-50%
<25%

What percent of production workloads use Kubernetes (or another container orchestrator)?

>75%
51-75%
25-50%
<25%

What percent of production environment routes leverage service mesh?

>75%
51-75%
25-50%
<25%

In addition to examining the survey results, we also aggregated information about the technical environments of Gremlin users to understand what specific tools and layers of the stack are most often targets of Chaos Engineering experiments. Those findings are below.

What is your cloud provider?

Amazon Web Services
Google Cloud Platform
Microsoft Azure
Oracle
Private Cloud (On Premises)

What is your container orchestrator?

Amazon Elastic Container Service
Amazon Elastic Kubernetes Service
Custom Kubernetes
Google Kubernetes Engine
OpenShift

What is your messaging provider?

ActiveMQ
AWS SQS
Kafka
IBM MQ
RabbitMQ

What is your monitoring tool?

Amazon CloudWatch
Datadog
Grafana
New Relic
Prometheus

What is your database?

Cassandra
DynamoDb
MongoDB
MySQL
Postgres

Contributors

Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers precise answers about the performance and security of applications, the underlying infrastructure, and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver more value with dramatically less effort.

Learn more

Epsagon enables teams to instantly visualize, understand and optimize their microservices architecture. With our unique lightweight auto-instrumentation, gaps in data and manual work associated with other APM solutions are eliminated, providing significant reductions in issue detection, root cause analysis and resolution times.

Learn more

Grafana Labs provides an open and composable monitoring and observability platform built around Grafana, the leading open source technology for dashboards and visualization. More than 1,000 customers such as Bloomberg, JP Morgan Chase, eBay, PayPal, and Sony use Grafana Labs, with more than 600,000 active installations of Grafana around the globe. Commercial products include Grafana Cloud, a managed stack that integrates includes Prometheus & Graphite (metrics), Grafana Enterprise, an enhanced version of Grafana with enterprise features, plugins, and support; Loki (logs), and Tempo (traces) with Grafana; and Grafana Metrics Enterprise, which enables Prometheus-as-a-service for large organizations running at scale.

Learn more

Founded in 2014 by Edith Harbaugh and John Kodumal, LaunchDarkly is the feature management platform that software teams use to build better software, faster with less risk. Development teams use feature management as a best practice to separate code deployments from feature releases. With LaunchDarkly, teams control their entire feature lifecycles from concept to launch to value. Serving over 1 trillion feature flags a day, LaunchDarkly is used by teams at Atlassian, Microsoft, and CircleCI.

Learn more

PagerDuty, Inc. (NYSE:PD) is a leader in digital operations management. In an always-on world, organizations of all sizes trust PagerDuty to help them deliver a perfect digital experience to their customers, every time. Teams use PagerDuty to identify issues and opportunities in real time and bring together the right people to fix problems faster and prevent them in the future. Notable customers include GE, Cisco, Genentech, Electronic Arts, Cox Automotive, Netflix, Shopify, Zoom, DoorDash, Lululemon and more.

Learn more