The guide to reliability in distributed systems

What reliability means for modern organizations and distributed systems, the challenges in improving reliability, and how organizations can overcome these challenges.

The world relies on complex, distributed computing systems and the engineers who maintain them. These systems form the backbone of countless systems and industries including financial networks, airlines, retailers, and software as a service (SaaS) providers. As the importance of these systems increases, so do customer expectations for availability. Ensuring that these systems are reliable is a critical, but often overlooked or under-prioritized task.

This series of articles will explain what reliability means, why it’s important, and how teams can make reliability a priority at their organizations. You’ll learn how to integrate reliability practices into your organization’s culture, how to reduce your risk of incidents and outages, and how to identify and address failure modes within your systems. We’ll also provide links to guides, tutorials, videos, and webinars so you can learn more.

What is reliability?

Reliability is the ability for a system to remain available over a period of time. Reliable systems are those that can continuously perform their core functions without service disruptions, errors, or significant reductions in performance. However, there are many different ways a system can fail, especially as a system becomes larger, more dynamic, and more complex. Our systems—and the people operating those systems—must be able to recover from these failures. This recoverability is called resilience. In order to maximize availability, systems must be both reliable and resilient.

Why is reliability challenging?

Reliability isn’t just a matter of enabling replication on a database or having redundant systems running in a separate region. Reliability is an alignment of technology, engineering practices, and organization culture. Engineering teams must not only understand the importance of improving reliability, but have the knowledge, tools, practices, and organizational support to make reliability a part of their everyday workflow. For site reliability engineering (SRE) teams, this means prioritizing redundancy, fault-tolerance, automated healing, and similar mechanisms. And when failures do occur, incident management teams must have the knowledge and practice to respond to and fix them quickly.

Additionally, reliability isn’t a short-term, one-and-done project. It requires us to build a reliability practice in our organization, train our teams to develop and test for reliability, and address failure modes in our systems. Our systems also change over time as new employees get onboarded, new code gets deployed, and unanticipated behaviors emerge. Just as QA testing is an ongoing process that reveals defects and regressions in code, reliability testing is an ongoing process that reveals defects in systems.

Learn how QA is changing by reading our white paper: The new QA: How modern applications are changing traditional testing.

Throughout this process, teams must also navigate obstacles and risks to their reliability efforts, including:

Continuing to build and maintain applications.
Responding to incidents and outages.
Completing major technical initiatives, such as migrating from on-premises data centers to the cloud (16% of organizations cite performance and reliability as a top challenge to overall cloud adoption).
Adopting new technologies, such as containers and Kubernetes.

For many organizations, these obstacles can become deterrents to reliability. Nonetheless, a reliability initiative has significant benefits that offset these challenges.

Why should you make reliability a priority?

The most obvious risk of not prioritizing reliability is an increase in outages, which are interruptions in a system’s normal operations. Outages have significant real-world consequences. According to Gartner, companies lose an average of $336,000 per hour of downtime, with top ecommerce sites risking up to $13 million in lost sales per hour of downtime. This doesn’t include other harmful long-term effects, such as losses in customer trust, decreased company valuation, and the indirect costs of fixing outages (engineer’s salaries, restoring from backups, spinning up additional infrastructure, etc). For companies that provide critical services, outages can have a direct impact on customer lives.

There are other benefits too. From an operational perspective, improving reliability lets teams spend less time mitigating the damage caused by outages, and more time focusing on their core competencies. According to the 2021 State of Chaos Engineering report, organizations that prioritize reliability have:

Fewer high-severity incidents per month.
Lower incident resolution times (measured by the mean time to resolution, or MTTR).
Fewer bugs shipping into production.

Improving reliability also improves the customer experience, builds trust in your brand, and gives you a competitive advantage. Ultimately, reliability isn’t just a technology investment, but an investment in the growth and success of the organization.

For every dollar spent in failure, learn a dollar’s worth of lessons

Jesse Robbins

Master of disaster

How do you start improving reliability?

As mentioned earlier, reliability is a journey that involves multiple facets of the business. We can distill this journey into three main elements: culture, observability, and technology. To see how each element contributes to reliability, click on the links below.

The organization’s approach to reliability.

Educating teams on what reliability is, why it’s important, and how they can contribute.
Training and engaging teams in reliability practices.
Creating strategies for anticipating and responding to incidents.

The understanding of how systems operate and behave.

Using observability and monitoring to measure the state of your systems.
Using alerts to monitor for and detect failures.
Setting reliability goals and tracking progress towards greater technical reliability.

The systems that your organization operates or utilizes to provide your services, including cloud systems and third-party dependencies.

Finding and mitigating potential failure modes in distributed systems.
Troubleshooting and addressing failures when they happen as quickly as possible.
Implementing automated mechanisms to quickly detect and recover from failures.

Creating a culture of reliability

How a culture of reliability helps teams build more reliable systems and processes.

When we think about reliability, we typically think of reliability in terms of systems. The reality is that reliability starts with people. By encouraging site reliability engineers (SREs), incident responders, application developers, and other team members to proactively think about reliability, we can be better prepared to identify and fix failure modes.

In this section, we’ll explain what a culture of reliability is, how to foster and develop a culture of reliability, and how it helps improve the reliability of our processes and systems.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started