Chaos Engineering is a disciplined approach to finding failures before they become outages. You literally "break things on purpose" to learn how to build more resilient systems.
If you're curious to try Chaos Engineering for yourself, but want to practice in a demo environment first, this tutorial is for you.
In this tutorial, we'll walk through 3 chaos experiments to test the reliability of our demo app. We'll do this using Gremlin, a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform.
After completing this tutorial, you'll have hands-on experience running chaos experiments in a demo environment and be able to run them with confidence on your own infrastructure.
To successfully complete this tutorial, you’ll need:
Chaos Engineering is a disciplined approach to identifying failures before they become outages. By testing how a system responds under stress, we can proactively fix vulnerabilities and make our systems more reliable.
So how does this work in practice? The best way to start is by creating a thoughtful, planned chaos experiment to validate expected behavior. First, ask yourself, "What could go wrong?" (For example, what happens when one of the third-party services we rely on goes down?) Then, use the scientific method to create a hypothesis, run a controlled experiment simulating the failure, and measure the impact.
After running your experiment, you'll have one of two outcomes. Either you’ve verified that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you can fix the problem before it causes an outage.
Will our demo app be resilient in the face of failure or will we experience an outage? Let’s find out with Chaos Engineering.
We’ll use Chaos Engineering to test how our demo app handles the following failure scenarios:
When we inject these failures into the demo app, we'll be able to see if the system maintains its functionality or if its services degrade under stress.
We'll use this open-source microservices application as our demo environment. The demo app has eCommerce functionality. We'll refer to it as the Sock Shop going forward.
Follow instructions here to configure AWS CLI, kubectl and eksctl.
In your respective region, launch an EKS cluster. For example:
1eksctl create cluster --name sockshop-eks-cluster --version 1.15 --region us-west-2 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 4
Using kubectl, deploy Sock Shop.
Clone the repo below and and go into the deploy/kubernetes folder.
1git clone https://github.com/microservices-demo/microservices-demo
1kubectl create namespace sock-shop
1kubectl apply -f complete-demo.yaml
If you haven't already, request a free trial of Gremlin.
Activate your account using the link sent to your email.
To install the Gremlin Kubernetes agent, you will need your Gremlin Team ID and Secret Key. If you already know what those are, you can skip ahead to installing the agent. If you don’t know what your Team ID and Secret Key are, you can get them from the Gremlin web app.
Configuration
. Make a note of your Team ID.Reset
button. You’ll get a popup reminding you that any running agents using the current Secret Key will need to be configured with the new key. Hit Continue
. Next you’ll see a popup screen that will show you the new Secret Key. Make a note of it.(Skip this step if you are using secret-based authentication)
Download the Gremlin certificates (you need at least team manager access)
Unzip certificates.zip
Rename the files in the certificates folder. Team Name.pub_cert.pem
becomes gremlin.cert
. Team Name.priv_key.pem
becomes gremlin.key
.
Create a gremlin namespace: kubectl create namespace gremlin
Create a kubernetes secret by running the following:
1kubectl -n gremlin create secret generic gremlin-team-cert --from-file=/path/to/gremlin.cert --from-file=/path/to/gremlin.key
Download the Gremlin configuration manifest by running the following:
1wget https://k8s.gremlin.com/resources/gremlin-conf.yaml
Open the file and update the following:
Replace the following line with your team ID: "YOUR TEAM ID GOES HERE"
Replace the following line with your team secret: "YOUR TEAM SECRET GOES HERE"
(If you are using certificate-based authentication, remove this line.)
Replace the following line with a string that you will use to identify your cluster: "YOUR UNIQUE CLUSTER NAME GOES HERE"
Apply the manifest with this command:
1kubectl apply -f /path/to/gremlin-conf.yaml
If you are using certificate-based authentication:
Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml
If you are using secret-based authentication:
Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client-secret.yaml
Most Kubernetes deployments configure master nodes with the node-role.kubernetes.io/master:NoSchedule
taint. You can run the following command to see if any of your nodes have this taint:
1kubectl get no -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints2NAME TAINTS3kube-01 [map[effect:NoSchedule key:node-role.kubernetes.io/master]]4kube-02 <none>
If you wish to install Gremlin on a Kubernetes master that has been tainted, add a tolerations section to the PodSpec of the Gremlin Agent Manifest.
1tolerations:2 - key: node-role.kubernetes.io/master3 operator: Exists4 effect: NoSchedule
You will need to reapply the Gremlin agent manifest after making this change.
If you are using certificate-based authentication:
Download and apply the k8s agent manifest by running:
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml
If you are using secret-based authentication:
Download and apply the k8s agent manifest by running:
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao-secret.yaml
For more information on using Container Insights, see this documentation.
Log in to the bastion host, and run
1kubectl get svc -o wide -n sock-shop | grep LoadBalancer
Copy the load balancer’s DNS name.
Paste this DNS into a browser to access the sock shop front-end.
Navigate around to get a feel of all the functions of the shop. Things to try out:
For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.
Recommended Scenarios
in Gremlin, then click View Details
for “Validate Auto Scaling” scenario.Add targets and run
.Run Scenario
.In AWS Console, look into Container insights:
Go to the AWS Console and select EC2
from Services.
On the left navigation bar, select Auto Scaling Groups
.
Each Cluster gets its own Auto Scaling Group. Select the one you need, and then at the lower navigation, select Scaling Policies
.
Select Add Policy
, then Create a simple scaling policy
.
Give the Policy a name, we will call it “Cluster-ScaleUp” and select Create New Alarm
. Create the alarm to go off when CPU utilization is greater than or equal to 13% for at least 1 minute, and name it "Cluster-ScaleUp."
Press Create Alarm
. Now you will be taken back to finish editing the policy.
Edit the values to add “1” instance and then wait 120 seconds before the next activity. Press Create
when finished.
We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown,” we will be creating a new alarm. This Alarm will be for when CPU utilization is less than or equal to (<=) 13% within 15 minutes.
For this experiment, we will test and discover what happens when there is a service dependency outage as your primary service attempts to make requests to it.
In Gremlin, create a new attack.
Select the Containers
tab.
In the Search bar, look up carts-db
.
Scroll down and click Choose a Gremlin
.
Select Network
-> Blackhole
Gremlin.
Change the length
of the attack to 300 seconds.
Click Unleash Gremlin
.
As the attack is running, try the following:
Is there any customer impact?
Are systems recovering gracefully?
Is there any way to mitigate this?
Halt All Attacks
to stop this attack.Did systems recover?
What did we learn?
For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.
In Gremlin, create a new attack.
Select the Containers
tab.
In the Search bar, look up carts-db
.
Scroll down and click Choose a Gremlin
.
Select State
-> Shutdown
Gremlin.
Switch off Reboot
.
Click Unleash Gremlin
.
As the attack is running, try the following:
Is there any customer impact?
Are systems recovering gracefully?
How might you mitigate this?
Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.
How to create a chaos experiment:
Was this failure detected?
Did this failure have customer impact?
Did the impact of this failure expected, or, did it match your hypothesis?
Can this failure be handled or mitigated?
While running experiments on a demo app is admittedly pretty fun, it doesn't improve the reliability of your systems. Start running experiments on your own infrastructure to test and validate your systems' response to failure and improve overall reliability.
Check out our documentation to install Gremlin anywhere, including bare-metal, on-prem, VMs, containers, serverless and Kubernetes environments.
If you'd like to try all Gremlin Attacks, including Packet Loss and Memory, request a demo and we'll set you up with a free trial of Gremlin.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started