In a recent blog post, we explained how every application has a critical path, which is the set of components that are essential to the application’s operation. A failure along the critical path makes your application unavailable, which means unhappy customers, reduced revenue, and a hit to your company’s reputation. For these reasons, we need to focus on making our critical path as reliable as possible. But to do that, we first need to know which components are part of it. We need to consider:
In this tutorial, we’ll show you how to identify the parts of your critical path using Chaos Engineering. We’ll identify each of the components that make up our application, run chaos experiments to determine how essential they are, observe how our systems respond, and use this information to map out our path. This way, we can determine where to focus our efforts on improving reliability while learning more about how our applications work.
Before starting this tutorial, you’ll need:
First, we need to identify the different components, services, and dependencies that make up our application. If you already have an architecture diagram, great! If not, draft up a high-level diagram showing these components and how they’re related. Draw a visible line between services that are linked or networked together, as this indicates a dependency. For simplicity, only focus on application components and services, not infrastructure or network topology. For example, Online Boutique provides the following diagram:
Based on this diagram alone, we can make some assumptions about the critical path. Remember that the critical path is the set of components that must be up and running for our application to perform its core function. For Online Boutique—as with any e-commerce site—this means letting customers browse products, add to cart, and place orders.
We can assume that the Frontend service is part of this path because it’s the point of entry for customer traffic. We can also assume that the ProductCatalogService, CheckoutService, PaymentService, CartService, and Redis are part of the critical path because they handle product interactions, checkout, payment processing, and shopping cart functionality respectively. If we highlight these services, our diagram now looks like this:
This is a good starting point, but how do we know that we’ve identified every critical service? We can assume that the AdService and EmailService aren’t required for customers to place orders, but how do we know this for sure? To test our assumptions, we’ll use Gremlin to simulate an outage in one of these services, try performing our application’s core function, and use our observations to determine whether the service is critical. By repeating this process with different services, we can create a clearly defined map of our critical path.
Next, we need to choose which service to test. We could pick a service that we’re confident is part of our critical path, like the Frontend, but for this experiment let’s pick one that we’re not so sure about, like the EmailService.
EmailService is a backend service that gets called by CheckoutService whenever a customer places an order. CheckoutService sends the order details, and the EmailService sends a confirmation email to the customer. Ideally, this process should happen asynchronously; customers should be able to complete orders without first having to receive an email. We’ll design a chaos experiment around this assumption, then use Gremlin to test it.
Our hypothesis is this: if the EmailService is down, customers can place orders without noticing any changes in application performance or latency. We’ll simulate an outage by using a blackhole attack to block all network traffic between the CheckoutService and EmailService. For safety, we’ll abort the test if the attack causes orders to fail, as this tells us that EmailService is part of the critical path. Our chaos experiment looks like this:
Now that we’ve defined our experiment, let’s run the attack. Before running the experiment, we’ll open our application in a web browser so that we can directly observe the impact by trying to place orders while the attack is running.
While the attack is running, open your browser and place an order. When you click “Place Order,” the page will get stuck in a loading state, and the order will eventually complete after 20–30 seconds. If we check the logs for the CheckoutService, we’ll see connection errors related to the EmailService:
1kubectl logs deploy/checkoutservice2{3 "message":"failed to send order confirmation to \"someone@example.com\": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.247.203:5000: i/o timeout\"",4 "severity":"warning",5 "timestamp":"2020-09-15T19:48:40.701358275Z"6}
If we look at the source code for the CheckoutService, we see that it makes an RPC call to the EmailService. This call is synchronous, meaning it will block execution until it receives a response. And while there is a timeout set, it’s long enough for customers to become frustrated while waiting for the site to load. EmailService is effectively part of our critical path even though we didn’t intend for it to be.
How do we fix this? Order verification emails are important, but they don’t need to be delivered immediately. We can make the call to EmailService asynchronous so that we can quickly return control to CheckoutService shortly after making the RPC call. The problem with this approach is that if the EmailService is down, we’d need to code a complicated retry mechanism or risk not emailing the customer.
A more advanced—but more effective—solution would be to use a message pipeline like Apache Kafka to broker messages between the two services. This decouples the EmailService from the CheckoutService, allowing both services to send and consume data at their own speed. This adds another layer of complexity to our stack, but it will let us tolerate dependency failures, guarantee message delivery, and ensure a great customer experience.
We found a hidden dependency in one service, but what about other services, like ShippingService, RecommendationService, or AdService? What about third-party dependencies like mailing services, managed databases, and external payment processors? We can repeat this same experiment across each of those services to test whether they’re also part of our critical path. We can also test services that we think are part of the critical path, such as the CurrencyService, since they might actually be non-critical. Testing them helps us ensure we can tolerate them failing. Using Gremlin, we can test all of these different scenarios across our entire stack quickly and safely.
Additionally, dependencies can fail in more ways than just being unavailable. Latency, packet corruption, and elevated resource consumption can lead to problems that are just as bad—if not worse—than it being unavailable. These will vary depending on the type of dependency and how our application interacts with it. For example:
While we’d like to guarantee four or five-nines of uptime for every service we manage, we often don’t have the time or manpower. Not all of our services are equally critical to the user experience. Reliability is an incremental process, and by focusing on the applications and services that are the most essential to our business, we can greatly reduce the risk of an outage taking down our core operations.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started