Automate reliability testing in your CI/CD pipeline using the Gremlin API

For many software engineering teams, most testing is done in their CI/CD pipeline. New deployments run through a gauntlet of unit tests, integration tests, and even performance tests to ensure quality. However, there's one key test type that's excluded from this list, and it's one that can have a critical impact on your application and your organization: reliability tests.

As software changes, reliability risks get introduced. Without thorough testing, these risks can find their way into production and result in incidents, outages, and downtime. To avoid costly failures, teams need to add automated reliability testing in addition to their existing tests.

In this blog, we'll show how to do this by integrating Gremlin with your CI/CD pipeline.

How Gremlin works

Gremlin is a Chaos Engineering and reliability testing platform. It enables software engineers to run experiments on their systems and validate their resilience to failures such as node failures, networking errors, and cloud availability zone outages. Gremlin provides over a dozen experiment types to test these different failures, and a suite of pre-built tests designed to test your services against industry best practices. For Kubernetes-based services, Gremlin also automatically detects reliability risks and highlights them for you.

In addition to a web application, Gremlin also provides a comprehensive REST API for managing services, experiments, and tests. We recommend using the API to integrate with CI/CD tools, and use the API in the examples in this blog.

How does Gremlin fit into CI/CD?

CI/CD—short for Continuous Integration and Continuous Delivery—is a set of practices and processes for managing code changes all the way from development to production. Continuous Integration (CI) is the practice of managing source code changes so that multiple developers can contribute to the same codebase. Continuous Delivery (CD) is the process of taking this codebase and deploying it to a live system.

The primary goal of CI/CD is to automate and streamline the deployment of code changes into production while maintaining high software quality standards. Tools like Jenkins, GitLab, and GitHub Actions enable this by providing tools to build code, deploy artifacts, and integrate with other tools.

CI/CD is also a place where teams enforce software quality standards. This usually means running a series of unit, integration, and performance tests to see whether the new code negatively impacts functionality or performance. But one aspect that's often missing from this is reliability testing. Integration tests are good for finding errors in code, but finding reliability risks is equally, if not more, important. Unreliable code can cause production outages, which cost companies time, revenue, and customer goodwill.

To avoid this, teams need to integrate reliability testing into their CI/CD pipelines, and Gremlin provides the means to do so.

Methods of integrating reliability testing into CI/CD

In this blog, we'll look at two ways teams integrate Gremlin into their CI/CD pipeline:

Using Gremlin to run reliability tests similar to unit or integration tests.
Using Gremlin to generate reliability scores, which then become a gating function for releases.

In both cases, we'll assume you have a test or staging environment that you can deploy new code to for testing. You'll also need to deploy the Gremlin agent to this environment, as the agent is required for running tests.

Method 1: Running a test suite similar to QA testing

This method involves using Gremlin to run ad-hoc reliability tests during the CI/CD pipeline alongside other tests. Gremlin provides a REST API for starting experiments, querying their status, and determining the outcome.

You can run one-off experiments using the attacks endpoint, but we recommend using Scenarios via the scenarios endpoint. Scenarios let you run multiple experiments sequentially, as well as monitor the state of the system being tested using Health Checks. Health Checks integrate directly with your observability tool of choice so Gremlin can immediately determine if a test is negatively impacting your service.

For example, you can configure Gremlin to run a series of experiments to consume increasing amounts of CPU capacity on a service, while simultaneously watching a Datadog monitor. If the Datadog monitor indicates the service has entered an unhealthy or undesirable state, Gremlin will automatically stop the Scenario and mark it as a failure. Based on this outcome, you can then stop the build or notify an engineer to review the outcome.

Here's what this looks like in practice. After you've deployed your code to a test environment, add a step in your CI/CD platform to make a REST API call.

Tip

You can get a fully formatted version of this API call—including the bearer token—by opening the Scenario in the Gremlin web app and clicking "Gremlin API Examples" at the bottom of the page.


HTTP Method	`POST`
Endpoint URL	https://api.gremlin.com/v1/scenarios/[scenario_ID]/runs?teamId=[your_team_ID]
HTTP Headers	Content-Type: application/json Authorization: Bearer [Your bearer token]

We can use the REST API to check the results using the getRuns API endpoint. This returns all of the runs for the Scenario as a JSON array, including the most recent (you can specify a start date or end date to filter for a specific run). The response contains the property [].stage_info.stage, which tells us the current Scenario stage (Pending, Successful, Failed, etc.) If this property has the value Running, it means the Scenario is still running. Successful means the Scenario finished without any problems, and therefore the test passed. However, any other value (such as Halted) could indicate something went wrong. You can get additional metadata about the outcome with the property stage_info.stageMetadata.haltReason.

Benefits of running a test suite

Running tests as part of your CI/CD process means you get immediate feedback on reliability. Since tests run on each new build, you're actively ensuring that all changes are free of reliability risks. This fits reliability testing nicely alongside your integration tests, load tests, and other tests.

Challenges to running a test suite

The main challenge in running a Scenario during CI/CD is time. Some experiments can take a long time to run, and if the goal is to prevent a build from releasing to production until it can be verified, this could significantly slow down your deployment pipeline. You could reduce this time by overlapping experiments with other tests, but then you risk false positives caused by overlapping test cases. For example, if you run a CPU test during a load test, how would that impact the system's CPU? Both tests could fail, resulting in a declined build.

Also, determining whether an experiment passed or failed requires you to either check in periodically using the Gremlin API, or use webhooks to have Gremlin notify your CI/CD tool when the experiment finishes. Both cases add complexity and require additional work for teams to implement and test, and may vary depending on the CI/CD platform.

Method 2: Using reliability scores as a gating function

The second method uses reliability scores. A reliability score is a value between 0 and 100 that represents how reliable a service is. Gremlin calculates and assigns a reliability score to each of your services based on the results of running a suite of experiments called reliability tests.

Note

To learn more about reliability scores and how they're calculated, see our blog: How Gremlin's reliability score works.

Because reliability scores offer a clear indicator of reliability, we can use them as a gating function. For example, if a service's reliability score is under 80%, that means the service is failing a lot of tests and has a greater chance of failing in production. Whoever owns the service should review the failures and fix the issues, or manually allow the change to go through.

To use reliability scores, we first need to add each of our services to Gremlin, then run the suite of reliability tests. Each test contributes to the service's reliability score, which we can then access using the getServiceDetails API:


HTTP Method	`POST`
Endpoint URL	https://api.gremlin.com/v1/reliability-management/[service-id]
HTTP Headers	Content-Type: application/json Authorization: Bearer [Your bearer token]

The score property in the response contains the numeric score:

json

1{
2  "serviceId": "e7a374f5-ddfc-4314-a374-f5ddfc631469",
3  "serviceName": "frontend",
4  "score": 84
5}

Benefits of using a reliability score

A reliability score is an objective representation of reliability. It acts a single standard for reliability across all services, so teams can easily set minimum score requirements. It can also be generated in advance, which means you don't need to run a full suite of tests on each push to production. Instead, you can schedule the tests to run weekly or on a custom schedule via the REST API.

Gremlin also tracks changes to your reliability score—and the results of each test run—so you know exactly which test failed, when it failed, and why it failed. Conversely, you can identify teams that consistently achieve reliability targets. For example, if a team maintains high reliability scores, you could pass their code through without waiting for tests to finish running, allowing them to push to production faster. This creates an additional incentive for teams to improve their scores.

Challenges of using a reliability score

This method removes the need to run reliability tests on each new build. However, because the score is calculated in advance, it will only reflect the service's reliability at the time of testing, not at the time of deployment. This makes the score a lagging indicator, and any new reliability issues introduced with the change won't get caught until the next suite of reliability tests is run.

You could still run a set of reliability tests for each build, like in method one. Gremlin's service baseline API lets you run the full suite of reliability tests on a service using a single API call. However, this means having to add logic to your CI/CD pipeline to wait for the tests to finish running and check for the updated score.

Conclusion

These are just two approaches to integrating reliability testing into your CI/CD pipeline. If you'd like to see how this works in a real-world environment, we spoke with a resiliency expert from UKG, a cloud-native HCM solution leader serving millions of active users. Learn how their resiliency team integrated Gremlin in their CI/CD pipeline in our on-demand webinar: Automate reliability in your CI/CD pipeline.

December 19, 2023 - 5 min read

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up…

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it…