For an enterprise organization with 30,000+ employees around the world, there can be no debate about the reliability of the platforms it delivers, which provide customers with datasets, solutions, and expert analysis. And when half the company is made up of developers, with single teams boasting 50 or more, there’s a very slim margin for error.
S&P Global Market Intelligence’s Reliability and Innovation team identifies and introduces best practices and tools for the entire S&P Global Marketplace, which provides organizations in a range of industries with research and consulting reports.
The team explored various tools and methodologies, landing on chaos engineering as the best approach for improving product reliability. Lead Developer Mani Shanmugavel and his team bumped up against limitations with several tools they tested.
Mani was spurred on by the experience of his colleague Ravi Venigalla, who was already familiar with Gremlin in his previous role related to site reliability engineering (SRE). So for S&P, Gremlin’s technology-agnostic, experiment-friendly features turned out to be the perfect fit to combine their chaos engineering efforts with event driven automation (EDA).
The Reliability and Innovation team sits within S&P Global Market Intelligence, which is a division of S&P Global, a leading provider of credit ratings, benchmarks, analytics and workflow solutions in the global capital, commodity and automotive markets. “We have different teams taking care of different development processes,” Mani explains. The team’s objective is platform reliability.
S&P needed to measure its initiative to improve reliability to confirm that there was concrete improvement. Service level objectives (SLOs) and indicators (SLIs) are clearly written into published reports for managers to keep a tight rein on meeting the service level agreements (SLAs).
While analyzing different platform and product reliability approaches, they hit upon chaos engineering. “We were enlisting different tool sets,” Mani says, “open source, like Chaos Monkey and Chaos Toolkit.”
Unfortunately for the team, they quickly hit roadblocks.
Chaos Monkey has limited functionality in terms of infrastructure, which was out of kilter with their need for more experimentation. "Some attacks at the network level aren’t possible in Chaos Monkey or with other open source tools,” Mani says.
There were compatibility issues too. “Our current systems are all deployed on AWS,” Ravi says.
“We have applications running in SpringBoot, .NET and React,” Mani adds. “We wanted a tool that is agnostic to all these different technologies. That's one of the reasons we looked at Gremlin.”
“Automation is one of our key objectives,” Ravi says. “We need to break the system to make sure that everything is working as expected, before even going into the production environment.” Chaos engineering is about preempting incidents and outages before production by intentionally breaking it in a controlled way so issues can be remediated before they impact customers.
Deploying in the staging environment — in the release pipelines — ideally entails performing automated chaos tests. S&P do test in production, “but with very limited scope,” Ravi says. “We only do it for major releases — not minor releases. In the load environments, we do it for all releases.”
The team needed a tool with the functionality to test various reliability risks to their systems, such as scaling, redundancy, and dependency loss and latency. Additionally, they needed a tool that would be compatible with various applications. Gremlin not only offered this combination, but provided automated testing solutions on top.
A key testing environment for automated testing is disaster recovery (DR). “To simulate the DR, we use Gremlin,” says Mani. Attacks initiated in a pipeline cause current region production failures which move over to the DR environment automatically.
Ravi explains that their team used Gremlin for black hole, CPU, and memory attacks. In fact, Ravi was already familiar with Gremlin from his time spent on SRE in his previous role, which made for comfortable onboarding within S&P. “We’re in the process of combining event driven automation with chaos engineering,” he says. “We disrupt systems using chaos engineering, and then we try to fix them with EDA.”
“Most of [our applications] are running in Kubernetes Clusters, hosted in AWS,” Mani adds. “We installed the Gremlin agent in the EKS clusters and initiated attacks.” They then assess if the parts are autoscaling.
This testing didn’t just involve their team, either. Developers, infrastructure, quality assurance and operations all got on board with reliability for what was very much an organizational effort.
The team didn’t stop there. They wanted to “bake reliability into the systems development life cycle (SDLC) process.” It was ultimately just as much about “cultural transmission” as it was about maintaining reliability of their key platforms.
This is ultimately how reliability engineering became part of the SDLC process for Market Intelligence.
One of the ways in which they encouraged this cultural transmission across the whole organization was through hackathons. “As part of our API cycle, we ran hackathons where we did a ‘Chaos Day’ kind of thing,” Ravi says.
“It was to let the developers know that there is a process called chaos engineering and there is a tool called Gremlin that you should bake into your development cycle,” Mani explains.
The results were powerful: hackathons helped amplify both what the team was doing and the tools they were using to do it.
The hackathons mirrored the team’s own experimentation with Gremlin. Ravi and Mani gave developers access to Gremlin and their development clusters, as well as the attack parameters. Another challenge was to automate the attacks through a pipeline, and the developers involved in the hackathon were provided with some guidance on what that entailed and granted access to Gremlin’s APIs.
“We got some amazing results,” Mani highlights. “The hackathons were a great success.
After initial reluctance from the developer community during the introduction of chaos engineering, the hackathons encouraged wide scale adoption across the Market Intelligence division. “It's baked into [S&P’s] development now,” Ravi says.
Other teams within the division now run their own tests without even needing to consult the reliability team. “Once they performed those tests, they understood the value,” he says. If their hypotheses are incorrect, there’s flexibility to fix them prior to production.
The fun comes from bringing down their own applications, assessing behavior, and thinking through solutions. “There’s a gaming factor to this,” Mani says. They challenged people to break the system in minutes, with rewards offered for success. “People got really excited about it — they started writing all these use cases to break their systems.”
Baking reliability into the development process on a division level and transforming culture in the process is just one aspect of the Reliability and Innovation team’s success.
Another benefit that came with the introduction of Gremlin is that their non-technical colleagues now knew why failures happened. Gone were the ambiguous messages displayed on screen with uncertain timeframes for when it would be fixed. They were now replaced by meaningful messages about controlled outages with what, specifically, was happening and how long it was going to take: The system is down, EDA kicks into action, and everything should be back up and running in a few minutes.
S&P has been using Gremlin for over a year now. What started off in one team in Market Intelligence has now ballooned out to the rest of the organization. Mani noted interest from the S&P Global Ratings group, “asking for access to Gremlin” so that they could begin their own reliability testing.
The success with Gremlin in Market Intelligence extends further than one division. Different CTOs and QA teams in different divisions talk. “That's how the interest gets generated,” he says.
“Gremlin is one of our key tools for reliability,” Ravi says. It has empowered this Market Intelligence team to work proactively and “practically identify issues,” Mani explains.
Before joining the team, their goal was to move from reacting to alerts after the fact to detecting them in prior cycles. “Gremlin really helped us with the transformation,” Ravi says.
“It’s given us the confidence to dare to fail,” Mani says. Their team expects S&P to roll Gremlin out across the organization. “It’s simply grown bigger,” he says. “We want to take it to every developer, QA team, and division within S&P.”
The organization can expect more reliable applications and improved customer experience as a result. “Gremlin is going to be part of that process.”
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started