The Challenge
Remind is a communication platform that allows schools and districts to reach and engage with their communities. With a presence in over 70% of US public schools, Remind experiences a pattern of cyclical usage: The number of active users skyrockets every year in August and September, when teachers and students return to school. In 2017, during the company's biggest back-to-school season to date, Remind passed 27 million active users and spent two weeks at the top of Apple's App Store.
To prepare for the annual surge, Remind adopted robust Chaos Engineering tools to test for potential failures—and prevent service disruptions that could impact the ability of teachers, students, and parents to communicate during one of the most important times of the school year.
The Solution
Remind ran chaos experiments with Gremlin's Latency Gremlin to prevent failures during the crucial period when millions of teachers, students, and parents were downloading and using the app. This allowed the team to introduce latency, a delay in network communication, to their Redis cache, which writes to DynamoDB.
It's important to run these experiments frequently—not just once—because latency can swing widely under different circumstances. For example, it’s useful to see if a service can handle 100ms of latency across the board. As confidence grows that lower amounts of latency won’t negatively affect the system, then expand the blast radius and run subsequent experiments with higher latency to ultimately find the point when the application fails. Another example would be to run experiments that test how the service responds when only a single host has high latency.
In this case, Remind ran latency experiments to see how slow DynamoDB could get before users started feeling the effects. Identifying potential issues in advance would allow the team to prepare for them more effectively.
As Remind’s Head of Infrastructure Engineering Michael Barrett put it,
Chaos Engineering allows us to have ‘pre-mortems’ instead of post-mortems.
Michael Barrett
Head of Infrastructure Engineering
Results
The results of these experiments revealed where Remind could prepare additional technical resources during peak usage. Even if Redis became slow, the team could ensure that upstream services would remain stable.
The team was also able to prioritize resource allocation during peak traffic to make sure that the most important features always remained functional. This way, Remind could spend two weeks as #1 in the App Store without affecting teachers, students, and parents.
Preparing for failure means maximizing user experience, and that can involve both design and technical solutions. The results of the chaos experiments also allowed Remind’s engineers to collaborate with their design team on new fallback experiences—so that no matter what, their users had a consistent experience on the platform.
What Comes Next
Chaos Engineering is now part of the engineering culture at Remind, empowering service owners to run chaos experiments with Gremlin. This includes incorporating and automating chaos experiments into the normal deployment pipeline alongside unit and integration testing.
Gremlins Used
Latency Gremlin - Injects latency into all matching egress network traffic.
The Latency Gremlin is one of four Network gremlins that allow you to see the impact of lost or delayed traffic to your application.