state of
Chaos Engineering
Key findings
Things break
- <=99%
- 99.5%-99.9%
- >=99.99%
- 1-10
- 10-20
- <1 hour
- 1 hour - 12 hours
- 12 hours - 1 day
- 1 day - 1 week
- > 1 week
- I don't know
- <20%
- 21-40%
- 41-60%
- 61-80%
- >80%
Who finds out
- Error Rate (Failed requests/total requests)
- Latency
- Orders/transactions vs historical predication
- Successful requests/total requests
- Uptime/total time period
- Real user monitoring
- Health checks / synthetics
- Server-side responses
- CEO
- CFO or VP of Finance
- CTO
- VP
- Managers
- Ops
- Developers
- Other
- CEO
- CFO or VP of Finance
- CTO
- VP
- Managers
- Ops
- Developers
- Other
Top performers
Top performers had 99.99%+ availability and an MTTR of under one hour (highlighted above). In order to achieve these impressive numbers, we looked into what tooling teams used. Notably, autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks were all more common in the top availability group. Some of these, such as multi-zone, are expensive, while others, such as circuit breakers and select rollouts, are a time and engineering expertise issue.
Teams who consistently run chaos experiments have higher levels of availability than those who have never performed an experiment, or do so ad-hoc. But ad-hoc experiments are an important part of the practice, and teams with >99.9% availability are performing more ad-hoc experiments.
- Never performed an attack
- Performed ad-hoc attacks
- Quarterly attacks
- Monthly attacks
- Weekly attacks
- Daily or more frequent attacks
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
- <99%
- 99%-99.9%
- >99.9%
Evolution of Chaos Engineering
- 2016
- 2017
- 2018
- 2019
- 2020
Chaos Engineering today
- Daily or more frequent attacks
- Weekly attacks
- Monthly attacks
- Quarterly attacks
- Performed ad-hoc attacks
- Never performed an attack
- Application Developers
- C-level
- Infrastructure
- Managers
- Operations
- Platform or Architecture
- SRE
- VPs
- 76%+
- 51-75%
- 26-50%
- <25%
- Dev/Test
- Staging
- Production
- Network
- Resource
- State
- Application
- Host
- Container
- Application
Results of chaos experiments
- Increased availability
- Reduced mean time to resolution (MTTR)
- Reduced mean time to detection (MTTD)
- Reduced # of bugs shipped to production
- Reduced # of outages
- Reduced # of pages
Future of Chaos Engineering
- Lack of awareness
- Other priorities
- Lack of experience
- Lack of time
- Security concerns
- Fear something might go wrong
The biggest inhibitors to adopting Chaos Engineering are a lack of awareness and experience. These are followed closely by ‘other priorities’ but interestingly more than 10% mentioned the fear that something might go wrong was also a prohibitor. It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences.
We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments. As the practice matures and tooling evolves, we expect it to be more accessible and faster for engineers and operators to design and run experiments to improve the reliability of their systems across environments - today, 30% of respondents are running chaos experiments in production. We believe that chaos experiments will become more targeted and automated, while also becoming more commonplace and frequent.
Demographics
The data sources for this report include a comprehensive survey with 400+ responses and Gremlin’s product data. Survey respondents are from a range of company sizes and industries, primarily in Software and Services. Adoption of Chaos Engineering has hit the enterprise, with nearly 50% of respondents working for companies with more than 1,000 employees, and nearly 20% working for companies with more than 10,000 employees.
The survey highlighted a tipping point in cloud computing, where nearly 60% of respondents ran a majority of their workloads in the cloud, and used a CI/CD pipeline. Containers and Kubernetes are reaching a similar level of maturity, but the survey confirmed that service meshes are still in their early days. The most common cloud platform is AWS at nearly 40%, with GCP, Azure, and on-premises following around 11-12%.
- >10,000
- 5,001-10,000
- 1,001-5,000
- 100-1,000
- <100
- Over 25 years old
- 10 to 25 years old
- 2 to 10 years old
- Less than 2 years old
- Software & Services
- Banks, Insurance & Financial Services
- Energy Equipment & Services
- Retail & eCommerce
- Technology Hardware, Semiconductors, & Related Equipment
- Software Engineer
- SRE
- Engineering Manager
- System Administrator
- Non-technical Executive (ex: CEO, COO, CMO, CRO)
- Technical Executive (ex: CTO, CISO, CIO)
- >75%
- 51-75%
- 25-50%
- <25%
- >75%
- 51-75%
- 25-50%
- <25%
- >75%
- 51-75%
- 25-50%
- <25%
- >75%
- 51-75%
- 25-50%
- <25%
- >75%
- 51-75%
- 25-50%
- <25%
- Amazon Web Services
- Google Cloud Platform
- Microsoft Azure
- Oracle
- Private Cloud (On Premises)
- Amazon Elastic Container Service
- Amazon Elastic Kubernetes Service
- Custom Kubernetes
- Google Kubernetes Engine
- OpenShift
- ActiveMQ
- AWS SQS
- Kafka
- IBM MQ
- RabbitMQ
- Amazon CloudWatch
- Datadog
- Grafana
- New Relic
- Prometheus
- Cassandra
- DynamoDb
- MongoDB
- MySQL
- Postgres
Contributors
Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers precise answers about the performance and security of applications, the underlying infrastructure, and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver more value with dramatically less effort.
Learn moreEpsagon enables teams to instantly visualize, understand and optimize their microservices architecture. With our unique lightweight auto-instrumentation, gaps in data and manual work associated with other APM solutions are eliminated, providing significant reductions in issue detection, root cause analysis and resolution times.
Learn moreGrafana Labs provides an open and composable monitoring and observability platform built around Grafana, the leading open source technology for dashboards and visualization. More than 1,000 customers such as Bloomberg, JP Morgan Chase, eBay, PayPal, and Sony use Grafana Labs, with more than 600,000 active installations of Grafana around the globe. Commercial products include Grafana Cloud, a managed stack that integrates includes Prometheus & Graphite (metrics), Grafana Enterprise, an enhanced version of Grafana with enterprise features, plugins, and support; Loki (logs), and Tempo (traces) with Grafana; and Grafana Metrics Enterprise, which enables Prometheus-as-a-service for large organizations running at scale.
Learn moreFounded in 2014 by Edith Harbaugh and John Kodumal, LaunchDarkly is the feature management platform that software teams use to build better software, faster with less risk. Development teams use feature management as a best practice to separate code deployments from feature releases. With LaunchDarkly, teams control their entire feature lifecycles from concept to launch to value. Serving over 1 trillion feature flags a day, LaunchDarkly is used by teams at Atlassian, Microsoft, and CircleCI.
Learn morePagerDuty, Inc. (NYSE:PD) is a leader in digital operations management. In an always-on world, organizations of all sizes trust PagerDuty to help them deliver a perfect digital experience to their customers, every time. Teams use PagerDuty to identify issues and opportunities in real time and bring together the right people to fix problems faster and prevent them in the future. Notable customers include GE, Cisco, Genentech, Electronic Arts, Cox Automotive, Netflix, Shopify, Zoom, DoorDash, Lululemon and more.
Learn more