When we think about reliability, we typically think of reliability in terms of systems. The reality is that reliability starts with people. By encouraging site reliability engineers (SREs), incident responders, application developers, and other team members to proactively think about reliability, we can be better prepared to identify and fix failure modes.
In this section, we’ll explain what a culture of reliability is, how to foster and develop a culture of reliability, and how it helps improve the reliability of our processes and systems.
A culture of reliability is one where each member of an organization works towards the shared goal of maximizing the availability of their services, processes, and people. Team members are focused on improving the availability and performance of their services, reducing the risk of outages, and responding to incidents as quickly as possible in order to reduce downtime.
Traditionally, software engineering teams treated reliability the same way they treated testing: as a distinct stage in the development lifecycle. This resulted in reliability becoming the sole responsibility of quality assurance (QA) and operations teams. As systems became more complex and development velocity increased, reliability became a shared responsibility between not just testers and operations teams, but also the developers building applications, the engineering managers leading these teams, the product managers solving for customer pain points, and even the executives in charge of budgeting and launching company-wide initiatives. All of these teams must be aligned on the goal of making the organization’s services more reliable in order to better serve its customers, and we call this organization-wide focus a “culture of reliability.”
But why do we need an entire culture focused on reliability? Couldn’t we just write more automated test cases, or plug a tool into our CI/CD pipeline to test our applications for us? For one, reliability is impacted by all stages of the software development lifecycle (SDLC), from design all the way up to deployment. Defects and failure modes are more expensive to fix later in the SDLC, especially if they end up causing production incidents.
Second, modern applications and systems are more complex and have more interconnected parts. While traditional testing is good at testing individual components, it’s inadequate at testing an entire system holistically. Improving reliability means testing and strengthening these complex interactions to prevent failures in one component from bringing down the entire system.
Lastly, organizations tend to prioritize other initiatives instead of reliability, such as shortening development cycles and quickly releasing new features. This isn’t because reliability isn’t important, but rather that it isn’t a top priority for many teams. Without a strong incentive coming from the organization, efforts and initiatives to improve reliability are less likely to maintain momentum. Faster feature development can even hinder reliability efforts by making systems more unreliable.
Reliability culture ultimately focuses around a single goal: providing the best possible customer experience. This singular focus on customers guides all other aspects of reliability, from developing more resilient applications and systems, to training SREs to respond to incidents more effectively. When there’s a clear correlation between customer satisfaction and reliability, organizations are more motivated to invest the time, energy, and budget needed to make systems and processes more reliable. It also directly ties reliability work to the company’s core mission, further cementing it as an important practice.
[The answer to ‘why do we need to be reliable’] is a single word: trust! Trust is the most important thing that we can deliver. For our platform to be viable, our customers have to trust that we will be available, and in order for us to earn the trust of our customers, we have to be reliable.
The amount of time and effort needed to build a culture of reliability scales with the size of the organization. Even in startups, where individuals are used to pivoting quickly, making sure everyone is aligned on the same goals is challenging. When building a culture of reliability, we need to consider our objective.
The main objective of improving reliability is to keep our systems and services available. Frequent outages result in lost revenue, lost customer trust, and engineering time spent responding to incidents instead of improving our product or service. But while this is a vital objective across all organizations, it’s not always a compelling driver of organizational culture. So what is?
The answer to this question should be closely tied to the organization’s mission statement. If we don’t have a mission statement or objective, we should start by focusing on our customers. How can we provide the best possible customer experience, and how does this translate to the day-to-day work done by our organization?
This question should be top of mind across the organization, especially in product teams, engineering teams, customer support teams, and executive teams. Each team should be aware of how their role and responsibilities contribute towards the customer experience. For example, if an engineer writes poorly optimized code, this could result in slower performance and increased latency that causes a customer to abandon the product. By framing reliability around customers, we can more easily start thinking about how different teams can impact reliability goals.
To get teams aligned, we should repeat our mission statement frequently. Highlight it during meetings, employee onboardings, and when planning new initiatives. If doubt arises about what our reliability objectives or goals are, we should always relate them back to the customer.
It’s common for organizational changes to get some pushback. You might hear arguments that reliability testing is too complicated, that it would take valuable time away from feature development, or that you’re already too busy with incident management. While investing in reliability does require an upfront investment, the benefits it brings greatly outweigh the costs. These include:
Teams often don’t think about reliability until late in the software development lifecycle (SDLC). Traditionally, engineering teams left the responsibility of reliability testing with QA. With how complex and rapidly developed modern applications are, this process is no longer scalable or fully effective. Not only does it create a roadblock to production and slow down release cycles, but it fails to find the unexpected and unique failure modes present in modern systems.
The solution is to shift left so that reliability testing is done throughout the entire SDLC, not just at the end. When planning a new feature or service, we start planning the customer experience that we want to provide as early as in the requirements gathering stage. Product Managers set expectations for service quality before development starts, SREs and application developers define metrics to measure and track compliance to those requirements, then we continuously test our ability to meet those requirements throughout development.
By prioritizing reliability early, the focus on improving reliability naturally carries across each team involved in the development process. This gives us an early start on finding and addressing defects, encourages good development practices, and reduces the risk of issues making their way into production. This also has financial benefits, as bugs are more expensive to fix later in the SDLC.
Culture is an important step, but we also need tools to help us put culture into practice. With reliability, we need a way to ensure that our response teams are prepared to handle incidents, our systems are resilient to technical failures, and our focus on reliability practices have a clear return on investment (ROI) for the business. The way we do this is with Chaos Engineering.
Chaos Engineering is the practice of deliberately injecting failure into a system, observing how the system responds, using these observations to improve its reliability, and validating that our resilience mechanisms work. While “system” most commonly refers to technical systems (particularly distributed systems), we can use Chaos Engineering to validate organizational systems and processes too. This includes incident management and response, disaster recovery, and troubleshooting processes.
Chaos Engineering helps teams proactively test for threats to reliability and address them early in the development process, reducing the risk of incidents or outages. This includes testing incident response plans, validating that systems can fail over to a redundant or backup system, and many other scenarios.
One of the biggest challenges in adopting a culture of reliability is maintaining the practice. Reliability isn’t something that can be achieved once: it has to be maintained and validated on a regular basis. The best way to do this is by regularly and proactively testing systems and processes by using Chaos Engineering. In fact, teams who consistently run chaos experiments have higher levels of availability than teams who have never performed an experiment, or who run ad-hoc experiments.
How does Chaos Engineering help build a reliability culture? The answer is by helping teams test their assumptions about their systems, actively seek ways to improve reliability, and ensure systems are resilient to production conditions. A common strategy for doing this is with GameDays, which are deliberately planned incidents where a team of engineers who own an application or service (along with other stakeholders, such as team leads and product managers) come together to run a chaos experiment on the service. The team runs the experiment, observes the service to see how it responds, and uses their insights to improve the resiliency of the service. They then automate the experiment, add it to their library of experiments, then run these experiments continuously to verify that their systems remain resilient.
A typical GameDay runs between 2–4 hours and involves the following team members:
To help build a reliability practice, run GameDays on a regular schedule. Teams with high availability tend to run experiments weekly, monthly, or quarterly. In general, running more frequent GameDays can help you achieve your reliability targets faster, so consider scheduling weekly or bi-weekly GameDays.
Once your teams are comfortable running planned incidents, consider adding unplanned incidents. These are called FireDrills. Like a GameDay, a FireDrill involves using Chaos Engineering to simulate failure on a system. The difference is that the teams responding to a FireDrill don’t know that it’s a drill. This causes them to react in a more realistic way, as if it were an actual incident, but keeps the ability to stop and roll back the incident if necessary.
FireDrills are effective at helping teams:
We recommend running FireDrills weekly or bi-weekly, but only after your team has had practice running GameDays. Designate a leader who can coordinate the FireDrill, preferably an engineering team lead who understands the systems being attacked. This ensures someone is always available to respond quickly if something unexpected happens, or if the FireDrill needs to be cancelled.
Incidents will happen, and that’s fine. No system is perfect. When something goes wrong, take corrective action and address the root cause as quickly as possible. Then, once your systems are operational, do a deep investigation and evaluation (called a post-mortem) of your response. Investigate the cause of the problem, the steps that the team took to resolve it, the metrics and other observability data that helped resolve the problem, and what the team did to prevent the problem from recurring.
Failure is okay: Chaos is going to happen, and we should be seeking out failure just to learn. Those uncomfortable points are where we learn the most.
Incidents are an emotionally stressful time for engineers, especially if they feel they were the ones who caused them. When running a post-mortem, don’t focus on assigning blame. Instead, focus on the processes that enabled the incident to occur in the first place. This is called a blameless post-mortem. For example, if a team member pushed bad code to production, maybe the solution is a more controlled deployment pipeline, more thorough automated testing, or more stringent peer reviews. “Pointing fingers” only discourages engineers from sharing their experiences and insights out of fear of punishment or retribution.
Incidents are an opportunity to learn and grow. As you resolve the root causes of incidents, use Chaos Engineering to validate that your fixes are working. Automate this process to ensure your systems remain resilient to the failure. Practice this with not only your own outages, but from incidents experienced and documented by other teams and organizations.
To improve the reliability of distributed systems, we need to understand how they’re behaving. This doesn’t just mean collecting metrics, but having the ability to answer specific questions about how a system is operating and whether it’s at risk of failing. This is even more important for the large, complex, distributed systems that we build and maintain today. Observability helps us answer these questions.
In this section, we’ll explain what observability is, how it helps solve complex questions about our environments, and how it contributes towards improving the reliability of our systems and processes.
Read moreGremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started