Automating a Chaos Engineering Environment on AWS with Terraform

Automating a Chaos Engineering Environment on AWS with Terraform
Last Updated:
Categories: Chaos Engineering

Chaos as Code (CaC) enables you to simply, safely and securely run, schedule and manage Chaos Engineering experiments. This tutorial will demonstrate how to use Hashicorp Terraform to automate your Chaos Engineering experiments.

Hashicorp’s Terraform is an open source tool that enables you to define infrastructure as code, increasing productivity and transparency. Terraform codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

In this tutorial, we will demonstrate how to use Terraform to create an EC2 instance and setup Gremlin to perform Chaos Engineering experiments. You will then perform a Chaos Engineering experiment on your EC2 instance in the form of a Gremlin Latency Attack. This tutorial will help you get started with using Terraform, and give you an idea of how it can be used for Chaos as Code (CaC).

Prerequisites

  • AWS Account (see EC2 permissions in Appendix A)
  • Gremlin Account (with Team ID and Secret ready - sign up here)
  • Terraform (for automating resource creation)
  • AWS CLI (nice to have to interact with your AWS environment)

Step 0: Verify Terraform Installation

If you don’t have Terraform installed, You can download the appropriate package here. On your local machine, verify your Terraform installation. You should see output like this:

bash
1terraform
2Usage: terraform [--version] [--help] <command> [args]
3
4The available commands for execution are listed below.
5The most common, useful commands are shown first, followed by
6less common or more advanced commands. If you're just getting
7started with Terraform, stick with the common commands. For the
8other commands, please read the help and docs before usage.
9
10Common commands:
11 apply Builds or changes infrastructure
12 console Interactive console for Terraform interpolations
13<...>
14
15All other commands:
16 debug Debug output management (experimental)
17 force-unlock Manually unlock the terraform state
18 state Advanced state management

Step 1: Create the VPC Environment

For separation, create two directories, one for VPC specification and another for Instance specification.

On your local machine:

bash
1mkdir -p ~/terraform/vpc ~/terraform/instance
2cd ~/terraform/vpc

Inside the vpc directory, create the following vpc.tf file using a text editor. We'll use vim throughout this tutorial. Replace the example region/az, tags, IP space and security group as required to set these up correctly for your AWS VPC.

bash
1vim vpc.tf

Enter the following information, changing the region, name, cidr, azs, public_subnets, owner, environment, name, and description fields with your own data.

bash
1provider "aws" {
2 region = "us-west-2"
3}
4
5module "vpc" {
6 source = "terraform-aws-modules/vpc/aws"
7 name = "gremlin_vpc"
8
9 cidr = "10.10.0.0/16"
10
11 azs = ["us-west-2a"]
12 public_subnets = ["10.10.1.0/24"]
13
14 tags = {
15 Owner = "your_name"
16 Environment = "chaos"
17 }
18}
19
20module "security_group" {
21 source = "terraform-aws-modules/security-group/aws"
22 name = "ssh"
23 description = "ssh from anywhere"
24 vpc_id = "module.vpc.vpc_id"
25
26 ingress_cidr_blocks = ["0.0.0.0/0"]
27 ingress_rules = ["ssh-tcp","all-icmp"]
28 egress_rules = ["all-all"]
29}

This vpc.tf terraform template file uses the aws provider, defines a VPC with a single public subnet in an availability zone, and a security group within this VPC to allow SSH access.

Let’s run a couple of commands to stand up the underlying networking infrastructure.

On your local machine:

bash
1terraform init
2terraform apply

Terraform will compute the resources that needs to be created, and you will then be prompted:

bash
1Do you want to perform these actions?
2 Terraform will perform the actions described above.
3 Only 'yes' will be accepted to approve.
4
5 Enter a value:

Enter yes, and Terraform will go ahead and create the resources. On successful completion you will see the following result:

1Apply complete! Resources: 12 added, 0 changed, 0 destroyed.

Note the Security Group ID (sg-xxxxxxxx) and Subnet ID (subnet-xxxxxxxx) for later.

That’s it! With just a few commands you have created a new VPC with internet gateway, a subnet in us-west-2a, a route table for the public subnet, and a security group allowing ssh access.

Step 2: Launch an Instance that registers to Gremlin

Now that you have the underlying networking environment prepared, and let’s focus on automating the creation of an instance.

Switch to the instance directory you created in Step 1:

bash
1cd ~/terraform/instance

Create the instance.tf template that defines the specification of the instance to launch. It references the userdata.sh script to install and authenticate a Gremlin agent at launch. You will create this userdata.sh file at a later step.

To populate the instance.tf template, you will need the following

If you do not recall the Security Group ID and/or the Subnet ID from the earlier step, you can retrieve them via the aws cli.

bash
1aws ec2 describe-security-groups --filters Name=group-name,Values=ssh --query 'SecurityGroups[0].GroupId' --output text

This is an example of the result you will see:

bash
1sg-91155cee
2
3aws ec2 describe-subnets --filters Name=tag:Name,Values="gremlin_vpc*" --query 'Subnets[0].SubnetId' --output text

This is an example of the result you will see: subnet-cbbd68b2

Populate instance.tf template file with the following content. Modify your_name, subnet_id, vpc_security_group_ids, and key_name accordingly.

On your local machine in the /terraform/instance directory, create the instance.tf file:

bash
1vim instance.tf

Note: If you are new to vim or need a refresher for vim commands, refer to this vim cheatsheet. Enter the following information, modifying region, name, subnet_id, key_name, and Owner accordingly:

bash
1provider "aws" {
2 region = "us-west-2"
3}
4
5data "aws_ami" "amazon_linux" {
6 most_recent = true
7 owners = ["amazon"]
8
9 filter {
10 name = "name"
11 values = [ "amzn-ami-hvm-*-x86_64-gp2", ]
12 }
13}
14
15module "ec2" {
16 source = "terraform-aws-modules/ec2-instance/aws"
17 instance_count = 1
18
19 name = "gremlin-instance"
20 ami = "${data.aws_ami.amazon_linux.id}"
21 associate_public_ip_address = true
22 instance_type = "t2.micro"
23 subnet_id = "subnet-cbbd68b2"
24 vpc_security_group_ids = ["sg-91155cee"]
25 key_name = "changeme"
26 user_data = "${file("userdata.sh")}"
27
28 tags {
29 Owner = "your_name"
30 Environment = "chaos"
31 DeployFrom = "terraform"
32 }
33}

This instance template file defines a t2.micro EC2 instance from the latest Amazon Linux AMI, to be launched in the specified subnet, with the SSH security group created earlier.

Downloading your Gremlin agent certificates

After you have created your Gremlin account (sign up here) you will need to find your Gremlin agent credentials. Login to the Gremlin App using your Company name and sign-on credentials. These were emailed to you when you signed up to start using Gremlin.

Next, navigate to the Team Settings page by clicking the user icon in the top-right corner of the screen (next to the Halt button) and selecting Team Settings. Click on the Configuration tab, then click the blue Download button to save your certificates to your local computer. The downloaded certificate.zip contains both a public-key certificate and a matching private key.

Unzip the downloaded certificate.zip on your laptop. Next, we will create the userdata.sh script.

On your local machine in the /terraform/instance directory, create the userdata.sh file:

bash
1vim userdata.sh

Enter the following information, replacing GREMLIN_TEAM_ID with your Gremlin team ID, GREMLIN_CERTIFICATE with the contents of your public certificate (your pub_cert.pem file), GREMLIN_PRIVATE_KEY with the contents of your private key (your priv_key.pem file), and YOUR_NAME with your name:

bash
1#!/bin/bash
2yum update -y
3curl https://rpm.gremlin.com/gremlin.repo -o /etc/yum.repos.d/gremlin.repo
4yum install -y gremlin gremlind
5export INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
6echo 'GREMLIN_CERTIFICATE' >> /var/lib/gremlin/pub_cert.pem
7echo 'GREMLIN_PRIVATE_KEY' >> /var/lib/gremlin/priv_key.pem
8sed -i '/#team_id/c\team_id: GREMLIN_TEAM_ID' /etc/gremlin/config.yaml
9sed -i '/#team_certificate/c\team_certificate: file:///var/lib/gremlin/pub_cert.pem' /etc/gremlin/config.yaml
10sed -i '/#team_private_key/c\team_private_key: file:///var/lib/gremlin/priv_key.pem' /etc/gremlin/config.yaml
11gremlin init -s autoconnect --tag instance_id=$INSTANCE_ID --tag owner=YOUR_NAME

This script adds the gremlin repository, installs the Gremlin agent and daemon, sets the configuration file with authentication details and instance tags, and finally starts the service to connect as a agent to Gremlin.

With everything ready, let’s run these templates.

bash
1terraform init
2terraform apply

Again, indicate yes and Terraform will bring up an EC2 instance. A successful result will appear as below:

bash
1module.ec2.aws_instance.this: Creation complete after 22s
2Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Now turn to the Agents page on Gremlin Control Panel.

You should see your newly brought up instance as an Online agent on Gremlin. Hooray!

Step 3: Run Your First Attack in your own Chaos Environment

Prepare a new Latency Gremlin Attack targeting the newly registered instance, but do not execute the attack just yet.

  • Log into the Gremlin web app.
  • Click Attacks in the left navigation bar, then click New Attack.
  • Select the Infrastructure tab, then select your EC2 instance. An easy way to find your instance is by entering its IP address or hostname in the search box.
  • Scroll down and click on the Choose a Gremlin section. Select the Network category, then select Latency.

Before starting the attack, SSH into the instance using your key file and start pinging www.google.com. You can do this by running the following commands on your local machine (make sure to swap in your own key file and IP address in place of mykey.pem and 34.214.21.96.

bash
1ssh -i mykey.pem ec2-user@34.214.21.96
2
3[ec2-user@ip-10-10-1-88 ~]$ ping www.google.com
4PING www.google.com (173.194.202.99) 56(84) bytes of data.
564 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=1 ttl=37 time=14.5 ms
664 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=2 ttl=37 time=14.5 ms
764 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=3 ttl=37 time=14.4 ms

Switch back to the browser where you have the Gremlin web app open and click Unleash Gremlin to execute the latency attack. Once the attack enters the Running stage, switch back to the terminal where ping is running. You should see the round trip time increase by 100ms similar to the output below (note the change between lines 3 and 4):

164 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=8 ttl=37 time=14.5 ms
264 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=9 ttl=37 time=14.5 ms
364 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=10 ttl=37 time=14.5 ms
464 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=11 ttl=37 time=114 ms
564 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=12 ttl=37 time=114 ms
664 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=13 ttl=37 time=114 ms
764 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=14 ttl=37 time=114 ms
864 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=15 ttl=37 time=114 ms
964 bytes from pf-in-f99.1e100.net (173.194.202.99): icmp_seq=16 ttl=37 time=114 ms

Congratulations! In a very short amount of time, you have automated the creation of a completely new environment apart from the rest of your running resources, launched an instance that connects automatically to Gremlin, and ran your first attack in this environment.

If you are feeling adventurous, we highly recommend that you play around with Terraform. Create additional subnets in more availability zones. Create private subnets that talks through NAT gateway to the internet. Increase the instance count to launch more Gremlin instances. Also take a stab at running other attacks with Gremlin in this environment.

Step 4: Cleaning up.

Let’s first terminate the instance.

On your local machine:

bash
1cd ~/terraform/instance
2terraform destroy

Similar to the creation of resources, Terraform will need you to confirm if you really want to destroy the resources.

bash
1Do you really want to destroy?
2 Terraform will destroy all your managed infrastructure, as shown above.
3 There is no undo. Only 'yes' will be accepted to confirm.
4
5 Enter a value:

Enter yes, and Terraform will go ahead and destroy the resources.

bash
1Destroy complete! Resources: 1 destroyed.

Now go ahead and also destroy the VPC.

On your local machine:

bash
1cd ~/terraform/vpc
2terraform destroy

Next time you want to spin up the environment again, simply use the templates you have used here, and you have your chaos environment within minutes.

Conclusion

By templatizing your chaos environment, you are able to quickly spin up an environment, run an attack to purposefully inject fault into the system, and return to zero footprint when you are done. Expanding on what you have achieved, if you also bring up your application within this environment, you're also able to evaluate and validate its resiliency against specific real-life operational scenarios. With the basics of running attacks down, you may want to think about running GameDays. If you need some help, here is How to Run a GameDay.

Appendix A - EC2 Permissions

You should have no issues if your user have the AdministratorAccess or AmazonEC2FullAccess policy attached. Otherwise, you will need permissions to the following API:

1ec2:AssociateRouteTable
2ec2:AttachInternetGateway
3ec2:AuthorizeSecurityGroupEgress
4ec2:AuthorizeSecurityGroupIngress
5ec2:CreateInternetGateway
6ec2:CreateRoute
7ec2:CreateRouteTable
8ec2:CreateSecurityGroup
9ec2:CreateSubnet
10ec2:CreateTags
11ec2:CreateVpc
12ec2:DeleteInternetGateway
13ec2:DeleteRoute
14ec2:DeleteRouteTable
15ec2:DeleteSecurityGroup
16ec2:DeleteSubnet
17ec2:DeleteVpc
18ec2:Describe*
19ec2:DetachInternetGateway
20ec2:DisassociateRouteTable
21ec2:ModifySubnetAttribute
22ec2:ModifyVpcAttribute
23ec2:ReplaceRouteTableAssociation
24ec2:RevokeSecurityGroupEgress
25ec2:RevokeSecurityGroupIngress
26ec2:RunInstances
27ec2:TerminateInstances

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started