When people think about reliability, it’s easy to focus on incident response and moving fast to fix outages. This reactive approach to reliability can very quickly lead to burnout as you bounce from incident to incident.
But that’s not the only way to think about reliability. In the webinar More Reliability, Less Firefighting: How to Build a Proactive Reliability Program, Principal Engineer and reliability expert Jeff Nickoloff laid out the four pillars of best-in-class reliability programs and the 18 actions you can take to build your own.
Based on Jeff’s experience at companies like Amazon and PayPal, along with Gremlin’s work with reliability leaders at Fortune 100 companies, the pillars represent a shift in how to approach reliability—one that can lead to a significant reduction in unplanned downtime, improved customer experience, and more satisfied engineering teams.
Check out these five key ways to think differently about reliability straight from Jeff.
1. Proactive reliability goals are essential
Incident response is essential for maintaining the reliability of your systems. But when you measure reliability by these reactive metrics, you’re measuring what your system’s reliability was, not what it is right now.
Proactive reliability goals put the emphasis on future reliability. They look at the reliability risks present in your system right now and give you a path to address them before they lead to incidents. So instead of waiting for the next incident to happen, you can take steps to improve the reliability of your system and prevent outages by quantifying the risks present in your system month over month, and demonstrating the work being done to reduce those risks over time.
2. Tooling needs to be operationalized
Like the art supplies sitting in the closet or the dusty table saw in the garage, tooling without a plan to use it quickly becomes shelfware.
Operationalization is essential for effective reliability results. If you really want to improve your organization’s reliability posture, then make it a priority to pair the right reliability and observability tooling with an operationalized practice that includes regular testing, measured results, and processes to work with other teams to repair reliability vulnerabilities.
This doesn’t have to be complex—especially for teams just getting started. A simple framework, such as Gremlin’s free Reliability Tracker spreadsheet, is a good place to begin mapping out your landscape to record, measure, and report your ongoing reliability. The Navigating the Reliability Minefield whitepaper goes over how to create one and includes a template you can use.
3. Reliability is an ongoing program
You wouldn’t ship a product and then stop working on the next feature, right? Or address a bunch of security vulnerabilities and get rid of your security team? And yet plenty of organizations will have an outage, fix it, run a post-mortem, and then stop paying attention to reliability until the next incident.
If your system is operating and needs to be reliable, then you need to invest in reliability on a continuous basis. With an ongoing program, you’re steadily testing and improving your reliability posture. As a result, you can reduce the number (and frequency) of costly incidents—and reduce the whiplash and burnout that comes with frantic incident response.
4. Your program needs an owner—and leadership support
This is everyone's problem and no one's problem simultaneously in most organizations. But who is going to operationalize this thing? Who’s the single person who's going to be responsible for making sure that this checklist is operationalized and followed up on regularly? Everyone needs to know who that person is. And that person needs to know who are the stakeholders and who are the people that are participating in the program.
Okay, so you’ve got regular reliability testing set up, a program funded, and you’ve found some reliability risks. Now what? Unfortunately, if you don’t have clear ownership and accountability, the odds are very low that the risks are ever going to get addressed. (At least, until they cause an outage. Then they’ll get addressed very quickly and very expensively!)
When you document the person responsible for operationalizing the reliability program, all of the owners of the individual services or products, and the leadership who will hold people accountable, suddenly everyone knows who to turn to when there’s a risk.
Everyone has a role to play in improving a system’s reliability. By recognizing, recording, and prioritizing these roles, you make it possible for them to have a demonstrable and valuable impact.
5. Reliability metrics need to connect to what matters for the business
You want to be able to show, ‘Hey, we had this issue. We tested it and reproduced it. We went off, made some changes, tested it again, and now we can show, yes, we are not going to have this issue again.’
You want to be able to show how this impacts customers. Don't just talk about the number of incidents. Talk about orders not dropped, or deals closed, or streams served. Things that matter to the customer—things that matter to your board of directors.
It’s impossible to show something that didn’t happen. This has long been the double-edged sword of tech debt and fixing reliability risks: how do you say you resolved an outage that never occurred? But if you set up the right reliability metrics (such as Reliability Scores), you can demonstrably show how your system now passes reliability tests it previously failed.
But that’s still meaningless if you don’t have a quantified business impact for your reliability. And make no mistake: reliability has a very quantifiable business impact. (According to New Relic’s 2023 Observability Forecast, the median annual cost of high-business-impact outages was $7.75 million.)
You need to adjust your mindset (and your metrics) to make that connection between reliability outages and business impact. You should be able to know roughly how much an outage will cost your company in engineering time and potential lost revenue from lost sales or missed SLAs. Then by preventing that outage, you also know how much value your reliability efforts created.
Conclusion
Improving reliability doesn’t have to mean frantically reacting to pages, caffeine-fueled incident response, and burnout. With the right reliability program, your organization can make steady improvements to reliability that have a demonstrable impact on your business.
It all starts with shifting how you think about reliability.
Next steps:
- Learn about the Four pillars of a best-in-class reliability program based on Gremlin’s work with hundreds of companies leading the charge to improved reliability.
- Download the How to Build a Best-in-Class Reliability Program checklist and build your own best-in-class reliability program.
- Watch the full More Reliability, Less Firefighting: How to Build a Proactive Reliability Program webinar with Jeff Nickoloff for more expert insights.