In this tutorial, we'll show you how to use Gremlin's Reliability Tracker. The Reliability Tracker is a framework that helps you find and fix reliability risks before they become disruptive outages. Designed by reliability and Chaos Engineering experts, it lets you map out your services, most likely failure scenarios, and the outcome of reliability testing in the form of an easy-to-read spreadsheet. By combining this spreadsheet with reliability testing, you’ll be able to test your systems, find reliability risks, and know what will happen if they fail—then prioritize your engineering efforts to stop disruptive outages before they happen.
This tutorial will show you how to:
Before starting this tutorial, you’ll need the following:
Note: You don't need a Gremlin account for this tutorial, but if you'd like to sign up for a free 30-day trial, you can do so here.
Before we can start tracking reliability, we need something to track. Gremlin uses services, which are sets of functionality provided by one or more systems within your environment. A service performs some specific function in your application, such as authentication, load balancing, data processing, etc. A service could be:
We recommend focusing on services for several reasons:
We'll start our Reliability Tracker by listing our services. Open a new spreadsheet in your preferred spreadsheet tool. In cell A1, enter "Service". In the column below that, list each service that you want to track. For example:
Service |
---|
AdService |
Authentication |
UserManagement |
EmailService |
It's also important to give each service a tier. A tier is your ranking of how important a service is to the business. For example, a Tier 1 is mission critical and can’t go down for any reason, while a Tier 3 can go down without significantly impacting the business. If you're not sure what tier to assign a service, think about it this way: if this service went offline right now, how quickly would you need to get it back up? If the answer is "immediately," it's a tier 1. If it's "when I'm done with my current task," it's probably a tier 3. If it's somewhere in between, it's a tier 2.
In column B, enter "Tier" for the heading and assign each service a tier:
Service | Tier |
---|---|
AdService | 3 |
Authentication | 1 |
UserManagement | 1 |
EmailService | 2 |
Next, we need to identify the different ways our services can fail. There are a few things we should keep in mind:
For example, we'll start with Gremlin's reliability tests, which include:
This leaves us with the following spreadsheet (limited for conciseness):
Service | Tier | CPU | Memory | Host | Zone |
---|---|---|---|---|---|
AdService | 3 | ||||
Authentication | 1 | ||||
UserManagement | 1 | ||||
EmailService | 2 |
Now you're ready to baseline your services. This means running an initial set of reliability tests—before doing any reliability work—to see where your services stand. There are many tools you can use to do this, but we, of course, recommend using Gremlin. You can use our suite of pre-built reliability tests, or our ever-growing library of Chaos Engineering experiments, by signing up for a free 30-day trial.
If you do decide to use Gremlin, we recommend setting up your services and running reliability tests. Follow our Gremlin Reliability Management quick start guide to set up Gremlin, add your services, and connect your monitors. The tests used in this tutorial have direct equivalents in Gremlin, so you don't need to worry about setting up tests from scratch. Gremlin also gives each service a reliability score, which will help you in the next step. To run the full set of tests (after adding a service and linking at least one Health Check), just find your service, click on it, then click Run All at the top of the page.
Once each test finishes and you have your results, record them in each cell. We recommend using the following key:
Symbol | Description |
---|---|
OK | Test passed |
? | Not tested |
X | Test failed |
N/A | Test not applicable |
What you'll end up with is a table that looks something like this (with different results, of course):
Service | Tier | CPU | Memory | Host | Zone |
---|---|---|---|---|---|
AdService | 3 | OK | OK | X | N/A |
Authentication | 1 | OK | OK | OK | X |
UserManagement | 1 | ? | X | OK | OK |
EmailService | 2 | ? | ? | X | N/A |
Another way to calculate a reliability score is by adding a formula to the spreadsheet. Add a new column to the end of the sheet and name it "Coverage". This column will contain a formula that automatically calculates scores for each service based on their test results. This isn't exactly the same as Gremlin's reliability score, but it does indicate how much testing has been done on each service.
The exact formula will vary depending on how you set up your spreadsheet, but here's an example of the formula for the spreadsheet we just created (note that the downloadable tracker spreadsheet already has this built in):
=IF(COUNTA(C2:F2)=0, "", COUNTIF(C2:F2,"=OK")/COUNTIF(C2:F2,"<>N/A"))
Let's break this down:
=IF(COUNTA(C2:F2)=0, "",
: This displays an empty score if the service doesn't have any completed tests.COUNTIF(C2:F2,"=OK")
: This counts the number of tests that have passed (i.e. have a result of OK
).COUNTIF(C2:F2,"<>N/A")
: This counts the total number of applicable tests (i.e. tests that are not marked N/A
).The score is based on the number of passed tests divided by the number of failed tests. The result is an at-a-glance indicator of which services are verified to be reliable, and which ones need additional testing.
Congratulations, you've just tested the reliability of your services! This first spreadsheet is your baseline: it represents your starting point on your reliability journey. As you perform additional assessments, you can compare your new scores to your baseline scores to see just how much progress you've made, and hopefully show some improvements!
Systems and services are changing constantly: new code gets integrated, infrastructure gets provisioned, and one-off changes or hotfixes get deployed. Any change like this can impact reliability, so you must repeatedly test your services to see if their reliability has decreased. Repeat testing also tracks improvements to reliability. If you make a change that makes a service more reliable, you won't know for sure until you test it again.
Before starting a new round of testing, duplicate the current page of your spreadsheet. Clear out the test results in the cells so that the new page is blank. This is your new results table for this next round of testing.
Make sure to add new services and remove services that are no longer relevant. Now repeat step 3 to fill in the new table with your new (hopefully improved) results.
Congratulations, you now have a living record of your reliability work and improvements! You can keep working off this spreadsheet you've created, or you can download our pre-made Reliability Tracker spreadsheet and modify it to fit your needs.
If you'd like a more automated and scalable way of tracking reliability, give Gremlin a try. Gremlin tracks all of your services and their dependencies, provides pre-built reliability tests that you can run with one click, and tracks your full testing history. You can easily see each service's reliability posture in the form of a reliability score. You can also track service reliability over time to quickly identify services that may have reliability risks, or recognize services that have become more reliable. Start with a free 30-day trial and see how easy it is to measure and manage reliability.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started