Updated January 24, 2020
Organizations planning to migrate to the cloud should embrace Chaos Engineering as a thoughtful strategy that helps avoid pain down the road.
Migrating applications to the cloud and embracing cloud computing is an intimidating prospect and understandably so - there is a lot that will change in your systems as you move from on-premises to the cloud, and these changes can mean instability in your systems.
Can you ensure your software will be safe after migrating to the cloud? How do you mitigate against the cloud's chaotic nature while providing a reliable and stable system? By intentionally inducing Chaos well before migration begins.
It sounds counter-intuitive to perform Chaos Engineering while your team is actively migrating to the cloud. Wouldn't that add failure and slow down an already challenging process? The reality is that when you are migrating to the cloud, Chaos Engineering is a great way to test how your new system will behave once you switch traffic over. By performing Chaos Experiments on the environment you are migrating into, you will identify previously unknown weaknesses while you have time to mitigate against them.
This blog post will discuss some foundational considerations for migrating to the cloud, regardless of which of the cloud providers you intend to use. We then describe a number of ways that things can go wrong and provide tutorials to run Chaos Experiments to proactively identify potential issues before they turn into production outages.
What Should I Think About When Migrating to the Cloud?
When your applications, data storage, networking, and hardware are all completely under your control in your data center, you have complete oversight and governance. You also have the headaches that can come with responsibility for maintaining infrastructure at the lowest levels.
Moving to Amazon Web Services, Google Cloud, Microsoft Azure, or even VMware provides you cost savings as you only pay for what you actually need and use, with no infrastructure sitting idle. All these and others offer quality cloud solutions.
If you are careful, you can mitigate against vendor lock-in with your cloud adoption, but moving to a different cloud is non-trivial and if you design for this level of portability you won’t be able to use any of the often powerful vendor-specific tools. It is usually best to pick a vendor you like and commit for several years, unless you have a fairly simple application that is easy to migrate to a new cloud.
There is the option of creating a multi-cloud system, where you host some services in one and others elsewhere, perhaps because certain vendors offer options and tools that work great for one part of your system while others offer options and tools that work better for other parts. Certainly, this can be done, but you are unlikely to get good support if you have problems and the inconsistency of management tools and styles across vendors is likely to give you problems.
Moving to the public cloud abstracts the platform and infrastructure headaches away from you, which is the main allure of software as a service (SaaS). Additionally, you can use cloud services to create high availability (HA) systems with regional diversity and redundancy with greater ease and less expense than building and maintaining multiple data centers.
Most cloud migrations involve migrating applications, but also migrating data, migrating databases, migrating software testing, migrating workloads, and more. Thinking through all of this and doing it in a way that makes sense is hard.
What Does a Cloud Migration Look Like?
Typically, in the process of migrating to the cloud, most enterprises perform a system redesign. While you can attempt to do so-called lift-and-shift, which is a simple rehosting of what you have from where it is to cloud hosting, you won’t get most of the major benefits of the cloud by doing only that. It can be a first step, but should then be followed at a minimum by replatforming your application by making a few cloud optimizations to take advantage of the benefits available. That is adequate for small organizations who have no plans for expansion or growth.
A cloud platform provides great opportunity for enhancing things like reliability, disaster recovery and mitigation, decreased downtime and increased uptime, but doing so well requires that you begin leveraging the benefits of cloud computing, beginning with your cloud migration strategy.
Many of the most successful migrations involve a cloud migration process where teams work to break their systems up into discrete microservices that can operate uncoupled from the rest of the system. This allows the services to be duplicated when needed for better capacity expansion, or even continue to be used alongside a canary deployment of a new version of that service that you are testing with limited traffic in production.
Some of those services are just simple functions that are called rarely, making them great candidates for deploying in a serverless manner and eliminating the need to think about the server, the operating system, and so on. Additionally, there is the possibility of increased upgrade and deployment velocity.
There are two ways to do this. The most common is to build an entirely new cloud-native system alongside the existing system, perhaps even doing so first in a private cloud. Then, as the original functionality is available in the new system (or a streamlined set of functions that better serve current business needs and strengthen the system by eliminating unneeded opportunities for failure), start shifting traffic over. That is a good migration plan.
Breaking up a monolith gradually, perhaps even using a hybrid cloud temporarily during the process, is also a great way to migrate. A real-life example of doing that in a non-computing application may be helpful to illustrate.
More than 20 years ago I bought my first house. It was old and a real fixer-upper. One of the things I did was upgrade the entire electrical system because when I bought it the wiring consisted of old cloth-covered knob and tube style wiring with a fuse box that had two 10 amp circuits.
I also discovered pennies under each fuse, which creates a complete circuit which bypasses the fuse entirely along with the protection fuses provide, because this box and service was inadequate for doing anything modern with electricity. As you migrate, expect to find similar workarounds in your legacy systems, bypassing the security and reliability you think you have. It happens frequently, usually by someone who is long gone from the company so there is no organizational memory of the workaround.
First, I upgraded the service panel to a modern 200 amp circuit breaker panel, which is like initiating a cloud deployment by creating the initial cloud infrastructure. Then, one by one, I began adding in new circuits to replace what existed. In essence, I was running my legacy application alongside the new infrastructure as pieces of the old were replaced bit by bit.
Little by little, I cut out the wiring from the original circuit and put new circuits in the bathroom, a couple in the kitchen, and area by area across the house, each new circuit designed for future upgradability and on its own discrete 20 amp breaker switch. This is what we do in the process of migrating a large application to the cloud. Take one function from that application and move it.
You can do this as a hybrid cloud, calling each new microservice from your existing monolith for a while while you migrate other functions. Sometimes you must create a hybrid cloud because you have contractual or legal obligations about sensitive data, and this provides a nice path for moving what you can to the cloud to save money.
You can also make the gradual move to the cloud by performing the initial move as a temporary lift-and-shift while you build your cloud-native microservices alongside it, running as a hybrid application during your refactoring. There are some good migration tools from different vendors to help with this.
As enterprises migrate to the cloud, using Chaos Engineering to test each change along the way will provide greater trust in the reliability of each piece and in the greater whole. The rest of this article suggests some specific chaos experiments you can run to help you better understand your changing system.
Evaluating Network Reliability
Network problems are a common cause of service outages. Even architectures designed with network redundancies can experience multiple, cumulative network failures. Moreover, most modern software relies on networks you do not control, which means a network outage completely outside of your oversight could cause a failure to propagate throughout your system.
Performing a Black Hole Attack with Gremlin
A Black Hole Attack temporarily drops all traffic based on the parameters of the attack. You can use a Black Hole Attack to test routing protocols, loss of communication to specific hosts, port-based traffic, network device failure, and much more.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Black Hole Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-d | --device | Network device through which traffic should be affected. Defaults to the first device found. |
-h | --hostname | Outgoing hostnames to affect. Optionally, you can prefix a hostname with a caret (^ ) to exclude it. It is recommended to include ^api.gremlin.com in the exclude list. |
-i | --ipaddress | Outgoing IP addresses to affect. Optionally, you can prefix an IP with a caret (^ ) to exclude it. |
-l | --length | Attack duration (in seconds). |
-n | --ingress_port | Only affect ingress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085 ). |
-p | --egress_port | Only affect egress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085 ). |
-P | --ipprotocol | Only affect traffic using this protocol. |
Start by performing a test to establish a baseline. The following command tests the response time of a request to
example.com
(which has an IP address of93.184.216.34
).bash1time curl -o /dev/null 93.184.216.3423# OUTPUT4real 0m0.025s5user 0m0.009s6sys 0m0.000sOn your local machine, create the
attacks/blackhole.json
file and paste the following JSON into it. Set your target Agent as necessary. This attack creates a30-second
black hole that drops traffic to the93.184.216.34
IP address.json1{2 "command": {3 "type": "blackhole",4 "args": ["-l", "30", "-i", "93.184.216.34", "-h", "^api.gremlin.com"]5 },6 "target": {7 "type": "Exact",8 "exact": ["aws-nginx"]9 }10}Execute the Blackhole Attack by passing the JSON from
attacks/blackhole.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.bash1curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/blackhole.json"On the target machine run the same timed
curl
test as before. It now hangs for approximately30
seconds until the black hole has been terminated and a response is finally received.bash1time curl -o /dev/null 93.184.216.3423# OUTPUT4real 0m31.623s5user 0m0.013s6sys 0m0.000sYou can also create, run, and view the Attack on the Gremlin Web UI.
Troubleshooting I/O Bottlenecks
Due to the proliferation of automated monitoring and elastic scaling, I/O failure may seem like an unlikely problem within a cloud architecture. However, even when I/O failure isn't necessarily the root cause of an outage it is often the result of another issue. An I/O failure can trigger a negative cascading effect throughout other dependent systems. Moreover, since I/O failure is often considered an unlikely event, it is often overlooked as a test subject. It should not be overlooked.
Performing an I/O Attack with Gremlin
Gremlin's IO Attack performs rapid read and/or write actions on the targeted system volume.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API IO Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-c | --block-count | The number of blocks read or written by workers. |
-d | --dir | The directory that temporary files will be written to. |
-l | --length | Attack duration (in seconds). |
-m | --mode | Specifies if workers are in read (r ), write (w ), or read+write (rw ) mode. |
-s | --block-size | Size of blocks (in KB) that are read or written by workers. |
-w | --workers | The number of concurrent workers. |
On your local machine create an
attacks/io.json
file and paste the following JSON into it. Change the target Agent as necessary. This IO Attack creates two workers that will perform both reads and writes during the45-second
attack.json1{2 "command": {3 "type": "io",4 "args": [5 "-l",6 "45",7 "-d",8 "/tmp",9 "-w",10 "2",11 "-m",12 "rw",13 "-s",14 "4",15 "-c",16 "1"17 ]18 },19 "target": {20 "type": "Exact",21 "exact": ["aws-nginx"]22 }23}Launch the IO Attack by passing the JSON from
attacks/io.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.bash1curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/io.json"On the target machine verify that the attack is running and that I/O is currently overloaded.
bash1sudo iotop -aoP2# OUTPUT3Total DISK READ : 0.00 B/s | Total DISK WRITE : 3.92 M/s4Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 15.77 M/s5PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND6323 be/3 root 0.00 B 68.00 K 0.00 % 71.28 % [jbd2/xvda1-8]720030 be/4 gremlin 0.00 B 112.15 M 0.00 % 17.11 % gremlin attack io -l 45 -d /tmp -w 2 -m rw -s 4 -c 1You can also create, run, and view the Attack on the Gremlin Web UI.
Managing Heavy CPU Load
An overloaded CPU can quickly create bottlenecks and cause failures within most architectures, whether cloud applications or traditional. In a distributed cloud environment, instability in a single system can quickly cascade into problems elsewhere down the chain. Proper CPU reliability testing helps to determine which existing systems are currently reliable despite a CPU failure, and which need to be prioritized for upgrade and migration necessary to maintain a stable stack.
Performing a CPU Attack with Gremlin
A Gremlin CPU Attack consumes 100% of the specified CPU cores on the target system. The CPU Attack is a great way to test the stability of the targeted machine -- along with its critical dependencies -- when the CPU is overloaded.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A CPU Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-c | --cores | Number of CPU cores to attack. |
-l | --length | Attack duration (in seconds). |
Most Gremlin API calls accept a JSON body payload, which specifies critical arguments. In all the following examples you'll be creating a local attacks/<attack-name>.json
file to store the API attack arguments. You'll then pass those arguments along to the API request.
On your local machine, start by creating the
attacks/cpu.json
file and paste the following JSON into it. This will attack a single core for30
seconds.json1{2 "command": {3 "type": "cpu",4 "args": ["-c", "1", "-l", "30"]5 },6 "target": {7 "type": "Random"8 }9}Create the new Attack by passing the JSON from
attacks/cpu.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.bash1curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/cpu.json"On the targeted machine you'll see that one CPU core is maxed out.
bash1htopYou can also create, run, and view the Attack on the Gremlin Web UI.
If you wish to attack a specific Agent just change the target : type
argument value to "Exact"
and add the target : exact
field with a list of target Agents. An Agent is identified on Gremlin as the GREMLIN_IDENTIFIER
for the instance, which can also be specified in a local environment variable when running the gremlin init
command.
1{2 "command": {3 "type": "cpu",4 "args": ["-c", "1", "-l", "30"]5 },6 "target": {7 "type": "Exact",8 "exact": ["aws-nginx"]9 }10}
Handling Storage Disk Limitations
Migrating to a new system frequently requires moving volumes across disks and to other cloud-based storage layers. It is vital to determine whether your new storage system can handle the increase in volume that the data migration will require. Additionally, you will also want to test how the system reacts when volumes become overburdened or unavailable.
Performing a Disk Attack with Gremlin
Gremlin's Disk Attack rapidly consumes disk space on the targeted machine, allowing you to test the reliability of that machine and other related systems when unexpected disk failures occur.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Disk Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-b | --block-size | The block size (in kilobytes) that are written. |
-d | --dir | The directory that temporary files will be written to. |
-l | --length | Attack duration (in seconds). |
-p | --percent | The percentage of the volume to fill. |
-w | --workers | The number of disk-write workers to run concurrently. |
On your local machine, start by creating the
attacks/disk.json
file and paste the following JSON into it. Be sure to change your target Agent. This attack will fill95%
of the volume over the course of a60-second
attack using2
workers.json1{2 "command": {3 "type": "disk",4 "args": ["-d", "/tmp", "-l", "60", "-w", "2", "-b", "4", "-p", "95"]5 },6 "target": {7 "type": "Exact",8 "exact": ["aws-nginx"]9 }10}(Optional) Check the current disk usage on the target machine.
bash1df -H2# OUTPUT3Filesystem Size Used Avail Use% Mounted on4/dev/xvda1 8.3G 1.4G 6.9G 17% /Create the new Disk Attack by passing the JSON from
attacks/disk.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.bash1curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/disk.json"Check the attack target's current disk space, which will soon reach the specified percentage before Gremlin rolls back and returns the disk to the original state.
bash1df -H2# OUTPUT3Filesystem Size Used Avail Use% Mounted on4/dev/xvda1 8.3G 7.9G 396M 96% /You can also create, run, and view the Attack on the Gremlin Web UI.
Proper Memory Management
While most cloud platforms provide auto-balancing and scaling services, it is unwise to rely solely on these technologies and assume they alone will keep your system stable and responsive. Memory management is a crucial part of maintaining a healthy and inexpensive cloud stack. An improper configuration or poorly tested system may not necessarily cause a system failure or outage, but even a tiny memory issue can add up to thousands of dollars in extra support costs.
Performing Chaos Engineering before, during, and after cloud migration lets you test system failures when instances, containers, or nodes run out of memory. This testing helps you keep your stack active and functional when an unexpected memory leak occurs.
Performing a Memory Attack with Gremlin
A Gremlin Memory Attack consumes memory on the targeted machine, making it easy to test how that system and other dependencies behave when memory is unavailable.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Memory Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-g | --gigabytes | The amount of memory (in GB) to allocate. |
-l | --length | Attack duration (in seconds). |
-m | --megabytes | The amount of memory (in MB) to allocate. |
(Optional) On the target machine check the current memory usage to establish a baseline prior to executing the attack.
bash1htopOn your local machine create an
attacks/memory.json
file and paste the following JSON into it, ensuring you change your target Agent. This attack will consume up to0.75 GB
of memory for a total of30
seconds.json1{2 "command": {3 "type": "memory",4 "args": ["-l", "30", "-g", "0.75"]5 },6 "target": {7 "type": "Exact",8 "exact": ["aws-nginx"]9 }10}Launch the Memory Attack by passing the JSON from
attacks/memory.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.bash1curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/memory.json"That additional memory is now consumed on the target machine.
bash1htopAs always, you can view the Attack within the Gremlin Web UI.
What Comes Next?
This article explored a number of common issues and outages related to failed migrations and upgrade procedures. As impactful and expensive as those outages may have been, their existence should not dissuade you from making the move to the cloud. A distributed architecture allows you to enjoy faster release cycles and, in general, increased developer productivity.
Instead, the occurrence of migration issues for even the biggest organizations in the industry illustrates the necessity of proper reliability testing. Chaos Engineering is a critical piece of that finished and fully-reliable puzzle. Planning ahead and running Chaos Experiments on your systems, both prior to and during migration, will help ensure you are creating the most stable, robust, and reliable system possible.