Gremlin is a safe, simple, and secure way to run Chaos Engineering on your systems to improve their reliability and tune your monitoring. Grafana is an open source, highly flexible analytics and visualization platform that can ingest data from many data sources and provide powerful dashboards, reports, and alerts. Combined, these two tools can be an effective combination to ensure the dashboards and alerts provided are actionable and useful to reduce mean time to detection (MTTD).
The list of options for a data source to hold Gremlin events is long, but this tutorial provides an example of writing directly to the Grafana API in Graphite format. Writing directly to Grafana’s database can impact performance if you are running hundreds of attacks per day across many applications, but for most use cases, this won’t have an impact.
If you're curious to see demo of running chaos experiments using Gremlin and Grafana, watch our on-demand webinar.
You first need to generate an API key that will allow you to securely send data to Grafana. In your Grafana UI, head over to “Configuration” (gear icon) -> “API Keys.”
Then, click “Add API Key” -> add a “Key name” such as Gremlin Key
and set the “Role” to Editor
and click “Add”.
Keep your API key handy, you'll need to reference this API key in Step 3.
Next, log into Gremlin where you can add the Grafana webhook to our account. If you don’t have an account, you can request a free trial.
Once logged in, you need to add a webhook to send attack data over to Grafana. In Gremlin, go to “Settings” (people icon next to the “Halt All Attacks” button) -> “Team Settings” -> “Webhooks”. Click “New Webhook”.
Enter the Name and Description of your webhook. In the “Request URL” field add your Grafana endpoint.
Grafana Cloud:
1https://{grafana_URL}/api/annotations/graphite
Personal Grafana deployment:
1https://{IP_address or URL}:{port}/api/annotations/graphite
Replace {grafana_URL} with the link to your cloud instance or your own local Grafana instance with the port (port 3000 is the default, but your implementation may be different).
Add a header with Authorization:Bearer {Grafana_API_Key}
using the API Key you generated in Step 1. In the “Payload” section, update the format to include a “what” key-value pair (required) and tags. Leave off the “when” key-value so that Grafana adds its own time tag. Below is the JSON template:
1{2 "what":"Gremlin Attack",3 "tags":["${TEAM_ID}",4 "${ATTACK_ID}",5 "${STATUS}",6 "${STAGE}",7 "${SOURCE}",8 "${ATTACK_TYPE}",9 "GremlinAttack"],10 "data":"Gremlin attack ${STATUS}"11}
Now you need a way to visualize the Chaos Experiments. This tutorial uses CloudWatch Metrics, but you can use any metrics tool and data source you want, such as Prometheus or InfluxDB. If you don’t have CloudWatch setup as a data source, check out the Grafana docs to add CloudWatch. In your Grafana instance, go to “Create” (plus sign) -> “Dashboard”
Click “Add new panel”
Select CloudWatch
as the source, the “Region” your EC2 instance is in, set “Namespace” to AWS/EC2
, “Metric Name” to CPUUtilization
, “Stats” to Average
, “Dimensions” to InsanceId = {your_EC2_insanceId}
. On the right hand side, change the "Panel title" to CPU Utilization
and under “Axis” -> “Left Y” change “Unit” -> “Misc” -> percent (0-100)
. Click “Apply.”
Annotations allow you to visualize with vertical bars when events, like attacks, started and stopped. You’ll create 2 annotations, one for when attacks begin running and another for when attacks finish. You can filter on any of the tags you included in the webhook, including TeamID, source, etc., but for this tutorial, add all GremlinAttack annotations in this example.
In your dashboard, click “Dashboard settings” (gear icon) and select “Annotations.”
Click on “New.” Fill in the “Name” with Gremlin Attack Running
, leave the “Data source” as Grafana
and change the “Tags” to GremlinAttack
and Running
, then click “Add.”
Then click New again and fill in the "Name" with Gremlin Attack Finished
and the Tags fill in with GremlinAttack
and Finished
and click “Add.”
Finally, you need to test out the new integration. CPU attacks are a great first attack to run to ensure that your monitoring tools are picking up the increased load and to check your autoscaling policies. Go to the Gremlin app. Click “Create Attack”. Select a target host that has metrics being sent to Grafana. Click “Choose a Gremlin.” Go to “Resources” -> “CPU.” Set the Length to 300
minutes, CPU Capacity to 80
% and All Cores
, and then click “Unleash Gremlin.”
The chart in Grafana will show the increase and decrease in CPU load over time, along with the annotations for the attack running and finishing.
This was just one example of using the Gremlin attack to correlate Chaos Engineering experiments with their impact inside Grafana. You can expand from here to other attacks and see how resilient your system is according to the charts that you follow in your monitoring tool. A great place to start is recreating a previous incident and checking how your updated systems handle the attack, as well as if your team can track the impact inside Grafana, and improve their recovery time compared to the previous incident.
If you're curious to see demo of running chaos experiments using Gremlin and Grafana, watch our on-demand webinar.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started