The world relies on complex, distributed computing systems and the engineers who maintain them. These systems form the backbone of countless systems and industries including financial networks, airlines, retailers, and software as a service (SaaS) providers. As the importance of these systems increases, so do customer expectations for availability. Ensuring that these systems are reliable is a critical, but often overlooked or under-prioritized task.
This series of articles will explain what reliability means, why it’s important, and how teams can make reliability a priority at their organizations. You’ll learn how to integrate reliability practices into your organization’s culture, how to reduce your risk of incidents and outages, and how to identify and address failure modes within your systems. We’ll also provide links to guides, tutorials, videos, and webinars so you can learn more.
Reliability is the ability for a system to remain available over a period of time. Reliable systems are those that can continuously perform their core functions without service disruptions, errors, or significant reductions in performance. However, there are many different ways a system can fail, especially as a system becomes larger, more dynamic, and more complex. Our systems—and the people operating those systems—must be able to recover from these failures. This recoverability is called resilience. In order to maximize availability, systems must be both reliable and resilient.
Reliability isn’t just a matter of enabling replication on a database or having redundant systems running in a separate region. Reliability is an alignment of technology, engineering practices, and organization culture. Engineering teams must not only understand the importance of improving reliability, but have the knowledge, tools, practices, and organizational support to make reliability a part of their everyday workflow. For site reliability engineering (SRE) teams, this means prioritizing redundancy, fault-tolerance, automated healing, and similar mechanisms. And when failures do occur, incident management teams must have the knowledge and practice to respond to and fix them quickly.
Additionally, reliability isn’t a short-term, one-and-done project. It requires us to build a reliability practice in our organization, train our teams to develop and test for reliability, and address failure modes in our systems. Our systems also change over time as new employees get onboarded, new code gets deployed, and unanticipated behaviors emerge. Just as QA testing is an ongoing process that reveals defects and regressions in code, reliability testing is an ongoing process that reveals defects in systems.
Throughout this process, teams must also navigate obstacles and risks to their reliability efforts, including:
For many organizations, these obstacles can become deterrents to reliability. Nonetheless, a reliability initiative has significant benefits that offset these challenges.
The most obvious risk of not prioritizing reliability is an increase in outages, which are interruptions in a system’s normal operations. Outages have significant real-world consequences. According to Gartner, companies lose an average of $336,000 per hour of downtime, with top ecommerce sites risking up to $13 million in lost sales per hour of downtime. This doesn’t include other harmful long-term effects, such as losses in customer trust, decreased company valuation, and the indirect costs of fixing outages (engineer’s salaries, restoring from backups, spinning up additional infrastructure, etc). For companies that provide critical services, outages can have a direct impact on customer lives.
There are other benefits too. From an operational perspective, improving reliability lets teams spend less time mitigating the damage caused by outages, and more time focusing on their core competencies. According to the 2021 State of Chaos Engineering report, organizations that prioritize reliability have:
Improving reliability also improves the customer experience, builds trust in your brand, and gives you a competitive advantage. Ultimately, reliability isn’t just a technology investment, but an investment in the growth and success of the organization.
For every dollar spent in failure, learn a dollar’s worth of lessons
As mentioned earlier, reliability is a journey that involves multiple facets of the business. We can distill this journey into three main elements: culture, observability, and technology. To see how each element contributes to reliability, click on the links below.
When we think about reliability, we typically think of reliability in terms of systems. The reality is that reliability starts with people. By encouraging site reliability engineers (SREs), incident responders, application developers, and other team members to proactively think about reliability, we can be better prepared to identify and fix failure modes.
In this section, we’ll explain what a culture of reliability is, how to foster and develop a culture of reliability, and how it helps improve the reliability of our processes and systems.
Read moreGremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started