Four pillars of a best-in-class reliability program

Reliability impacts every organization, whether you plan for it or not. Leading companies take matters into their own hands and get ahead of incidents by building reliability programs. But since many of these programs are still nascent, how do you know what good looks like?

Of course, the right tools and technology that can enable your team to uncover reliability risks before they impact users play an important role. But improving reliability goes beyond technology. All the tooling in the world is useless unless you operationalize its use.

A best-in-class reliability program extends across teams to improve the resiliency and availability of your systems. At the same time, it enables engineering teams to spend less time fighting fires and resolving incidents so they can focus on vital work like new features or innovations.

These are organizational hurdles, which means they require organizational solutions.

Gremlin has worked with reliability program leaders at Fortune 100 companies to identify the traits of successful programs. Reliability programs built around these four pillars and 18 actions align organizations, get crucial buy-in, and achieve real, measurable improvements to the reliability of their systems.

Four Pillars of a a Best-in-Class Reliability Program: Leadership & Strategy, Ownership & Handoffs, Measurement & Metrics, and Processes & Policies

1. Leadership & Strategy

Any program needs clear strategies and goals to be effective—and that holds true for reliability programs. It can be tempting to go with an over-simplified goal along the lines of, “Have less downtime,” but you’ll be much better served by a goal like, “Double the reliability of mission-critical customer-facing applications over the next two quarters.” That’s something you can build a strategy around.

Clear goals and strategies keep people aligned, help you get leadership buy-in, and set expectations for anyone involved. The other three pillars spread out from this one, with ownership telling you where to commit resources, metrics telling you your success, and processes helping you achieve your goals.

Make sure these actions are part of building your reliability program strategy:

Define clear, specific missions and goals. Know what you’re working towards, which services you’re targeting, milestone timelines, and the reliability levels you aspire to.
Identify initial timelines or mission-critical dates. Know how your timelines and resources are informed by business events. Set minimum reliability policies and target dates for compliance.
Focus on goals that are proactive, not reactive or chasing incident response. Reliability programs are about getting ahead of incidents to prevent them, rather than react to them. The strategy should have a long-term, policy-based focus on desired reliability levels.
Define clear accountability stakes and secure visible interest from leadership. Everyone involved should understand how they’ll be held accountable for the part of the program where they’re responsible. Everyone should see and feel leadership’s interest in the program.
Establish milestone-based review and celebration. Celebrate and perform retrospectives as the program achieves specific milestones, services hit improvement goals, and service reliability is reflected in the owner’s performance. Milestones are defined as both a target date and a reliability target. Don’t celebrate until you’ve met the reliability target.

2. Clear Ownership & Handoffs

A reliability program needs to drive action to be effective, which means that you need to have clear ownership and ownership handoffs. And that clarity extends beyond ownership within the program itself. When your efforts uncover reliability risks in services (and they will!), you need to know exactly who can own and resolve that risk.

You’ll want to make sure you:

Identify your program owner. Know who is taking responsibility for the program.
Centralize ownership for baselines, testing, and reporting. Define who is measuring your progress and what you’re being measured against.
Decentralize ownership for system improvements. Know who is responsible for making improvements to each service.
Create ownership handoff processes. Track ownership and continuously onboard new owners when service transfer events occur due to management changes or other factors.

3. Measurement & Metrics

As the old business saying goes, “If you didn’t measure it, then it didn’t happen.” If you can’t measure your reliability, then how can you know if your reliability strategy is working? But as you determine metrics, remember that reliability isn’t binary. On the surface, it may seem like there’s simple uptime and downtime, but there are hundreds of factors influencing your system’s reliability at any given time. Things like brownouts, lags, small outages under specific use cases, and other variables make it hard to measure with binary metrics. For example, if there’s a lag in processing a purchase, a customer might just leave and the sale will be lost. The system technically wasn’t down, but there was definitely a business impact.

A successful reliability program needs to have metrics that can establish a baseline of reliability as related to business value, show increases (or decreases) in reliability, and then share those metrics with the broader organization. To get this, you’ll need to:

Define the background behind the program. Document why you’re doing this and quantify the impact of downtime. All relevant parties should have access to this background and have reviewed it together.
Set up consistent and regular reliability measurement and normalized scoring. Understand how reliability is measured and ‌compare reliability fairly between all of your services.
Record your progress against your goals. Build a well-known and regularly reviewed progress report with historical data in a reverse chronological journal in a well-known location for quick review.
Tie high-value golden signals to business metrics. Specify golden signals that are mission critical for protecting your company and/or creating demonstrable value against business metrics.

4. Processes & Policies

Reliability isn’t a one-time switch that you flick and suddenly everything works perfectly. It takes a sustained effort from people across the organization, which is where processes and policies are essential to its success. By pairing the right processes and policies with clear ownership, you create accountability, which, in turn, creates results.

Follow the steps to create processes and policies that help you pair accountability with the ability to make informed decisions:

Build a catalog of services, their owners, and the impact of disruption. Make sure the program subjects are well-defined and their criticality well-understood.
Establish biweekly progress reviews. Set regular meetings with program owners, leadership, and service owners to review recent changes and unexpected spikes or dips in progress toward reliability goals.
Document new service onboarding. Build and regularly exercise new service onboarding processes and audit those processes.
Define response to services falling out of compliance. Document how you detect, respond to, and correct services that fall out of compliance with your reliability policies. This definition includes when you review, how you reach out to service owners, what information you collect about the regression, and reasonable timelines for correction.
Define response to services coming into compliance. Document how you recognize and celebrate services coming into compliance with your reliability policies.

Conclusion

Successful reliability programs are as much about organizational and interpersonal coordination as they are about getting the technology to work right. The four pillars above are essential for building the organizational support necessary to have a real impact on reliability.

And when you combine that program with the right reliability management tools, you’ll have a best-in-class reliability program capable of demonstrably improving the reliability, resiliency, and availability of your systems.

Ready to set up your own reliability program? Watch the How to Build a Proactive Reliability Program webinar and download the How to Build a Best-in-Class Reliability Program checklist to ensure your reliability program checks all the right boxes to make it successful.

December 19, 2023 - 5 min read

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up…

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it…