AWS GameDay Lounge : Chaos Engineering with Gremlin

Eugene Wu

Solutions Architect

Last Updated: November 25, 2019

Categories: Chaos Engineering

Pre-Requisites

To successfully complete this course, you’ll need:

A computer with Internet connection
An AWS account supplied at the GameDay Lounge
A Gremlin Pro account supplied by Gremlin

What is Chaos Engineering?

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.

Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.

If you would like to take a deep dive into learning more about the history and principles, check out this link.

Inject something harmful to build an immunity.

Why do Chaos Engineering?

Cost of Downtime

How to do Chaos Engineering?

Principles of Chaos Engineering:
- Plan an experiment
- Contain the Blast Radius and Magnitude
- Scale or Squash
Terms to know:
- Blast Radius: The number of hosts and/or containers that are targeted in an experiment.
- Magnitude: The intensity of the attack you’re running.
- Abort Conditions: What Conditions Would Cause You to Halt the Experiment?
  - Examples: Error Rate, Latency
  - Big Red Button - Make sure to Halt the experiment if one of your experiments hits one of the abort conditions
Scientific Method:
- Form a Hypothesis
- Experiment and Test It
- Analyze Results
- Expand Scope and Re-Test
- Share Results

Today’s Infrastructure and Demo Environment

Today’s demo environment https://github.com/microservices-demo

Getting Set Up

Access AWS Console

Log in via instructions to access AWS console
Click on EKS to confirm there is an EKS cluster. It should have “GremlinGameDay” followed by a string of characters.
Switch over to EC2’s dashboard, and confirm there are 3 instances available for the EKS cluster as well as a bastion host.

Access Monitoring

Open CloudWatch in the AWS Console: http://console.aws.amazon.com/cloudwatch
On the CloudWatch Dashboard, click on Overview and select Container Insights.
Make sure EKS Clusters is selected.

For more information on using Container Insights, see this documentation: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-view-metrics.html

Access Gremlin

Log into gremlin by checking your inbox to complete registration
- If you did not get an invite, or receive an error, please ask for support in #aws-gamedaylounge in the Chaos Engineering Slack. Instructions for joining it can be found at https://gremlin.com/slack.
- This will open a browser tab to https://app.gremlin.com.

Access your Bastion Host

Open EC2 Dashboard in the AWS Console: https://console.aws.amazon.com/ec2
Click on Running Instances and look for the instance with a “Bastion Host” name and select it
Click Connect
Select “EC2 Instance Connect”

For more information on EC2 Instance Connect and other options for connecting to your bastion host, see this documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html

Access Sock Shop

Log in to the bastion host, and run
- kubectl get svc -o wide -n sock-shop | grep LoadBalancer
Copy the load balancer’s DNS name
- Sample: a34.us-east-1.elb.amazonaws.com
Paste this DNS into a browser to access the sock shop front-end
Navigate around to get a feel of all the functions of the shop. Things to try out:
- Register and Log in
- Viewing various items
- Adding and removing items to and from cart
- Checking out items

Deploy Gremlin

In your bastion host, clone the Gremlin daemonset with git clone https://github.com/AnaMMedina21/gremlin.git
Then edit the daemonset withvi gremlin/daemonset.yaml. In this file we need to edit 2 fields, Team ID and Team Secret
- To get your Team ID:
  - Within Gremlin, access your company settings and click to the Teams tab, and your Team ID will be listed in the row. Copy this value.
  - Inside the daemonset YAML file, find the line that says “value: <YOUR TEAM ID>, and replace <YOUR TEAM ID> with the Team ID value that you copied from Gremlin.
- To get your Team Secret:
  - In Gremlin, under the same Teams tab, click Reset in Secret Key. Copy this value as it’s viewable only once.
  - Inside the daemonset YAML file, find the line that says “value: <YOUR SECRET KEY>, and replace it with the Team Secret you previously copied from Gremlin.
Save your file and exit.

We’re now ready to deploy the daemonset. Run kubectl apply -f gremlin/daemonset.yaml
Check https://app.gremlin.com/clients/infrastructure to ensure clients are now available and active.

Running a Chaos Experiment

1st Experiment (Scenario)

For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.

Click https://app.gremlin.com/scenarios/recommended and Click on “View Details” for “Validate Auto-Scaling” scenario
Scroll to the bottom and click Add targets and run
Select all 3 hosts found and click Run Scenario

1st Experiment (Scenario) Questions

Look into Container insights
Was this failure detected?
Did the outcome of this failure result in expected behaviors?
Would the service be able to handle this failure?

Remediation: Setup Auto-Scaling

Go to the AWS Console, Select ec2 from Services.

On the left navigation bar, select AutoScaling Groups.

Each Cluster get it’s own Auto-Scaling Group, select the one you need, and then at the lower navigation, select “Scaling Policies”

Select “Add Policy”, then “Create a simple scaling policy”

We will be creating two of these, one to scale up and one to set back to the usual 3.

Give the Policy a name, we will call it “Cluster-ScaleUp” and select “Create New Alarm”. Create the alarm to go off when CPU Util is greater than or equal to 13% for at least 1 minute, and name it Cluster-ScaleUp.

Press “Create Alarm”

Now you will be taken back to finish editing the policy

You want to edit the values to add “1” instance and then wait 120 seconds before the next activity. Press Create when finished.

We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown”, we will be creating a new alarm. This Alarm will be for when CPUUtilization is less than or equal to (<=) 13% within 15 minutes.

<Re-run scenario #1? Customize your scenario if you would like to see more auto scaling events kick off

2nd Experiment

For this experiment, we will test and discover what happens when there is network degradation as your primary service attempts to request a downstream dependency.

In Gremlin, create a new attack: https://app.gremlin.com/attacks/new
Select the Containers tab
In the Search bar, look up carts-db

Scroll down and click “Choose a Gremlin”
Select “Network” -> “Latency” Gremlin
Change the length of the attack to 300 seconds, and set MS to 1000
Click Unleash Gremlin

2nd Experiment Questions

As the attack is running, try the following:
- Add items to the cart
- Remove items from the cart
- Update quantity of items in cart
Is there any customer impact?
Are systems recovering gracefully?
Is there any way to mitigate this?
In https://app.gremlin.com/attacks/infrastructure, click Halt All Attacks to stop this attack.
Did systems recover?
What did we learn?

3rd Experiment

For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.

In Gremlin, create a new attack: https://app.gremlin.com/attacks/new
Select the Containers tab
In the Search bar, look up carts-db

Scroll down and click “Choose a Gremlin”
Select “State” -> “Shutdown” Gremlin
Switch off “Reboot”
Click Unleash Gremlin

3rd Experiment Questions

As the attack is running, try the following:
- Add items to Cart
- Access items in Cart
- Remove items in Cart
- Check out
Is there any customer impact?
Are systems recovering gracefully?
- How long did it take?
- Is full customer experience restored?
How might you mitigate this?

Running a Chaos Experiment

Create Your Own Experiment!

Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.

Use the Blank Chaos Experiment card above to start forming a scenario. Then create this attack in Gremlin!

Key Questions to Ask

Was this failure detected?
Did this failure have customer impact?
- If so, what are they?
Did the impact of this failure expected, or, did it match your hypothesis?
- If not, what happened instead?
Can this failure be handled or mitigated?

Break Through

Join over 4000+ Engineers in the Chaos Engineering Community Slack: Join Us

FAQ

Where should I go to get support?

To get support, head to https://gremlin.com/slack to join the community, and join the #aws-gamedaylounge channel.

Can I replay this workshop on my own?

If you want to spin up this demo environment, the CloudFormation template to do so is located here.

By default, you will have to deploy this template in us-east-1.

Following completion of this deploy, you can then replay the workshop in this site.

Start

How to run an experiment on AWS Lambda using Failure Flags and Node.js

Introduction In this tutorial, we'll show you how to run a Chaos Engineering experiment on a serverless application…

Andre Newman

Sr. Reliability Specialist

Start

How to run multiple experiments in parallel using Gremlin

Introduction Gremlin lets you run multiple Chaos Engineering experiments in a single workflow called a Scenario…

Andre Newman

Sr. Reliability Specialist

Start

How to use your Gremlin reliability score in Jenkins to ensure reliable releases

Introduction Adding Gremlin to your CI/CD pipeline is a key step in automating your reliability efforts. We previously…

Andre Newman

Sr. Reliability Specialist

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started

Pre-Requisites

What is Chaos Engineering?

Why do Chaos Engineering?

How to do Chaos Engineering?

Today’s Infrastructure and Demo Environment

Getting Set Up

Access AWS Console

Access Monitoring

Access Gremlin

Access your Bastion Host

Access Sock Shop

Deploy Gremlin

Running a Chaos Experiment

1st Experiment (Scenario)

1st Experiment (Scenario) Questions

Remediation: Setup Auto-Scaling

2nd Experiment

2nd Experiment Questions

3rd Experiment

3rd Experiment Questions

Running a Chaos Experiment

Create Your Own Experiment!

Key Questions to Ask

Break Through

FAQ

Related

Avoid downtime. Use Gremlin to turn failure into resilience.