Experiments
An experiment is a method of injecting failure into a system in a simple, safe, and secure way. Gremlin provides a range of experiments which you can run against your infrastructure. This includes impacting system resources, delaying or dropping network traffic, shutting down hosts, and more. In addition to running onetime experiments, you can also schedule regular or recurring experiments, create experiment templates, and view experiment reports.
Gremlin provides three categories of experiments:
- Resource experiments: test against sudden changes in consumption of computing resources
- Network experiments: test against unreliable network conditions
- State experiments: test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes
Each experiment tests your resilience in a different way.
Resource Experiments
Resource experiments are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk space.
Experiment | Impact |
---|---|
CPU | Generates high load for one or more CPU cores. |
Memory | Allocates a specific amount of RAM. |
IO | Puts read/write pressure on I/O devices such as hard disks. |
Disk | Writes files to disk to fill it to a specific percentage. |
State Experiments
State experiments modify the state of a target so you can test auto-correction and similar fault-tolerant mechanisms.
Experiment | Impact |
---|---|
Shutdown | Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines. |
Time Travel | Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events. |
Process Killer | Kills the specified process, which can be used to simulate application or dependency crashes. Note: Process experiments do not work for Process ID 1, consider a Shutdown experiment instead. |
Network Experiments
Network experiments test the impact of lost or delayed traffic to a target. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.
Experiment | Impact |
---|---|
Blackhole | Drops all matching network traffic. |
Certificate Expiry | Checks for expiring security certificates. |
Latency | Injects latency into all matching egress network traffic. |
Packet Loss | Induces packet loss into all matching egress network traffic. |
DNS | Blocks access to DNS servers. |
Warning: Important considerations for targeting Kubernetes Pods with Network experiments
Network host tags
You can use tags to target IP addresses where traffic should be impacted during network experiments. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an experiment should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between serviceA
and serviceB
, select all clients with the tag service:serviceA
when choosing the Hosts to target, and select the tag service:serviceB
when configuring the Network experiment. IP addresses assigned to the network interface by the container runtime are also automatically included.
Network providers
Limit the impact of a network experiment to specific external service providers. Select one or many services and their associated region to impact. Gremlin currently supports AWS, Azure, and Datadog services. The destination network configuration is automatically updated daily using these sources: AWS discovery service, Azure service tags.
Network device selection
All network experiments accept a --device
argument that refers to the network interfaces to target. Starting with Linux agent version 2.30.0 / Windows agent version 1.9.0, you can specify one or more network interfaces using either a comma-separated list or with multiple --device
arguments.
When unspecified, Gremlin targets all physical network interfaces as reported by the operating system. For virtual / cloud machines that typically includes the expected network interfaces like eth0
and eth1
for Linux and Ethernet
for Windows.
Device discovery on older agents
Agents before Linux version 2.30.0 / Windows version 1.9.0 use a different strategy described here. All network experiments accept a --device
argument that refers to the network interface to target. Gremlin network experiments target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:
- Gremlin omits all loopback devices (determined by [RFC1122]).
- Gremlin selects the device with the lowest interface index that starts with
eth
,en
, or for WindowsEthernet
. - If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to [RFC1918]).
- If nothing was found, Gremlin selects the first device with the lowest interface index.
Experiment stage progression
Every experiment in Gremlin is composed of one or more Executions, where each Execution is an instance of the experiment running on a specific target.
The Stage progression of an experiment is derived from the Stage progression of all of an experiment's Executions. Gremlin weighs the importance of Stages to mark an experiment with the most important Stage of its executions.
Example
An experiment with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are TargetNotFound, Running, TargetNotFound
, the resulting stage for the experiment will be Running
.
You can see Stages ordered by their importance in the following section.
Stages
Stages are sorted by descending order of importance (the Running
Stage holds the highest importance)
Stage | Description |
---|---|
Running | Experiment running on the host |
Halt | Experiment told to halt |
RollbackStarted | Code to roll back has started |
RollbackTriggered | Daemon started a rollback of client |
InterruptTriggered | Daemon issued an interrupt to the client |
HaltDistributed | Distributed to the host but not yet halted |
Initializing | Experiment is creating the desired impact |
Distributed | Distributed to the host but not yet running |
Pending | Created but not yet distributed |
Failed | Client reported unexpected failure |
HaltFailed | Halt on client did not complete |
InitializationFailed | Creating the impact failed |
LostCommunication | Client never reported finishing/receiving execution |
ClientAborted | Something on the client/daemon side stopped the Gremlin and it was aborted without user intervention |
UserHalted | User issued a halt, and that is now complete |
Successful | Completed running on the Host |
TargetNotFound | Experiment not scoped to any current targets |
Scheduling experiments
Experiments can be run ad-hoc or scheduled, from the Web App or programmatically. You can schedule experiments to execute on certain days and within a specified time window. You can also set the maximum number of experiments a schedule can generate.
Running experiments on Kubernetes objects
Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.
Selecting containers
For State and Resource experiment types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Experiment page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the experiment, the underlying containers within the objects selected will be impacted.
Targeted containers also need to be able to resolve api.gremlin.com, otherwise the experiment will fail. Gremlin adopts all the configuration and resources of the pod it is experimenting.
Monitoring experiments in real time
You can observe your environments in real-time in Gremlin for CPU or Shutdown experiments, to quickly verify the effect of your experiments. For CPU experiments, you can see the statistics for CPU load; for Shutdown experiments, you can see machine uptime.
Enabling Experiment Visualizations
Company Admins and Owners can turn this feature on for their company by visiting the Company Settings, clicking Settings, and toggling Experiment Visualizations on. Only data relevant to the experiment is collected and no data is collected when experiments are not running.
Overriding Experiment Visualizations for a host
To prevent any host from sending metrics to populate experiment visualization charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.
Parameter reference
For details on parameters supplied to individual experiments, check out the links to the individual experiment pages at the beginning of this page.
Include new targets in ongoing experiments
When selecting targets by tag, you have the option to check the Include New Targets checkbox. When checked, if Gremlin detects a new target that meets the experiment's selection criteria, it will distribute the experiment to the target. By default, new targets will not run the experiment even if they match the selection criteria.
For example, imagine you select all EC2 hosts in the AWS us-east-1
region for a CPU experiment. When you run the experiment, AWS detects the increased CPU usage and automatically provisions a new EC2 instance and installs the Gremlin agent. If Include New Targets is checked, Gremlin will add this new instance to the ongoing CPU experiment.
Multiple values
Port and address options can be used multiple times in a single command.
1# Run a latency experiment on both DynamoDB and database.mydomain.org2gremlin attack latency -h dynamodb.us-west-1.amazonaws.com -h database.mydomain.org
Alternatively, a ,
can also be used to specify multiple values.
1gremlin attack latency -p 8080,443
Exclude rules
A ^
can be used before a port or address to exclude that argument from the set of impacted network targets.
1# Slow down all ports except DNS port2gremlin attack latency -p ^53
This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the experiment.
1# Blackhole all hosts in 10.0.0.0/24 except for 10.0.0.112gremlin attack blackhole -i 10.0.0.0/24 -i ^10.0.0.11