To successfully complete this course, you’ll need:
Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.
If you would like to take a deep dive into learning more about the history and principles, check out this link.
Inject something harmful to build an immunity.
Principles of Chaos Engineering:
Terms to know:
Blast Radius: The number of hosts and/or containers that are targeted in an experiment.
Magnitude: The intensity of the attack you’re running.
Abort Conditions: What Conditions Would Cause You to Halt the Experiment?
Scientific Method:
Today’s demo environment https://github.com/microservices-demo
For more information on using Container Insights, see this documentation: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-view-metrics.html
Log into gremlin by checking your inbox to complete registration
For more information on EC2 Instance Connect and other options for connecting to your bastion host, see this documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html
Log in to the bastion host, and run
kubectl get svc -o wide -n sock-shop | grep LoadBalancer
Copy the load balancer’s DNS name
Paste this DNS into a browser to access the sock shop front-end
Navigate around to get a feel of all the functions of the shop. Things to try out:
In your bastion host, clone the Gremlin daemonset with git clone https://github.com/AnaMMedina21/gremlin.git
Then edit the daemonset withvi gremlin/daemonset.yaml
. In this file we need to edit 2 fields, Team ID and Team Secret
To get your Team ID:
<YOUR TEAM ID>
, and replace <YOUR TEAM ID>
with the Team ID value that you copied from Gremlin.To get your Team Secret:
<YOUR SECRET KEY>
, and replace it with the Team Secret you previously copied from Gremlin.Save your file and exit.
kubectl apply -f gremlin/daemonset.yaml
For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.
Add targets and run
Run Scenario
Go to the AWS Console, Select ec2 from Services.
On the left navigation bar, select AutoScaling Groups.
Each Cluster get it’s own Auto-Scaling Group, select the one you need, and then at the lower navigation, select “Scaling Policies”
Select “Add Policy”, then “Create a simple scaling policy”
We will be creating two of these, one to scale up and one to set back to the usual 3.
Give the Policy a name, we will call it “Cluster-ScaleUp” and select “Create New Alarm”. Create the alarm to go off when CPU Util is greater than or equal to 13% for at least 1 minute, and name it Cluster-ScaleUp.
Press “Create Alarm”
Now you will be taken back to finish editing the policy
You want to edit the values to add “1” instance and then wait 120 seconds before the next activity. Press Create when finished.
We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown”, we will be creating a new alarm. This Alarm will be for when CPUUtilization is less than or equal to (<=) 13% within 15 minutes.
<Re-run scenario #1? Customize your scenario if you would like to see more auto scaling events kick off
For this experiment, we will test and discover what happens when there is network degradation as your primary service attempts to request a downstream dependency.
Unleash Gremlin
As the attack is running, try the following:
Is there any customer impact?
Are systems recovering gracefully?
Is there any way to mitigate this?
In https://app.gremlin.com/attacks/infrastructure, click Halt All Attacks
to stop this attack.
Did systems recover?
What did we learn?
For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.
Unleash Gremlin
As the attack is running, try the following:
Is there any customer impact?
Are systems recovering gracefully?
How might you mitigate this?
Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.
Use the Blank Chaos Experiment card above to start forming a scenario. Then create this attack in Gremlin!
Was this failure detected?
Did this failure have customer impact?
Did the impact of this failure expected, or, did it match your hypothesis?
Can this failure be handled or mitigated?
Join over 4000+ Engineers in the Chaos Engineering Community Slack: Join Us
Where should I go to get support?
To get support, head to https://gremlin.com/slack to join the community, and join the #aws-gamedaylounge channel.
Can I replay this workshop on my own?
If you want to spin up this demo environment, the CloudFormation template to do so is located here.
By default, you will have to deploy this template in us-east-1.
Following completion of this deploy, you can then replay the workshop in this site.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started