Building more reliable financial systems with Chaos Engineering

The financial services industry has built in more capital buffers to prevent market shocks from bringing another economic collapse. In addition to these financial controls, many banks and personal trading platforms have begun building resiliency into information technology shocks. Despite these new precautions, we’re still seeing outages today, preventing customers from depositing and withdrawing their money, completing transactions, and executing trades during key events.

Additionally, banks need to do the work to innovate and modernize in order to improve customer experience and defend against ever-decreasing switching costs. However, ensuring both reliability and innovation is particularly challenging in a world of heavy and increasing regulation, as well as systems that are dependent on legacy systems that cannot be modernized. For example, the UK and Australian regulatory bodies are cracking down on banks, requiring publicizing outage data and submitting reports on how banks will build more resilient systems. Legacy, mainframe-based systems become extra baggage while trying to implement changes to improve the user experience. Financial institutions must find a way to grow their competitive advantage without leaving reliability behind.

Chaos Engineering is the best way to test for reliability and support modernization in financial systems. Chaos Engineering is the practice of performing precise experimentation on a system by injecting measured amounts of harm to observe how the system responds for the purpose of discovering how to improve the system’s resilience. These are situations services will inevitably face, but in the critical world of finance, we must proactively test our systems for weakness in order to ensure seamless operations.

We will discuss a few ways that applying these practices benefit financial institutions and improve their reliability.

Lower costs while migrating systems

Migrating applications and information systems to modern architectures like the cloud pose challenges for firms trying to remain available during the migration. Migrations and architectures allow for drastic cost savings if done properly, leading to higher resource utilization and faster experimentation.

However, for many retail banks, personal trading, and wealth management platforms, entire applications cannot be rewritten and migrated. This inevitably leaves these financial institutions with hybrid clouds or two parallel systems. While newer parts of their applications like mobile trading platforms migrate more easily, they remain reliant on legacy, mainframe applications such as account systems.

Applying Chaos Engineering during the migration process allows firms to make the leap safely. Cloud tools like autoscaling increase uptime and reduce the amount of money spent on infrastructure to only pay for what is used. Testing how the system responds to resource constraints and system state failures allows developers to ensure that their autoscaling happens at the right time: too early costs money, and too late impacts customers. Developers can also simulate unreliable networks to see how systems respond to the inevitable degraded or lost connection to the legacy or retiring systems. This proactive approach to testing for failure aligns with the need for precise engineering in banking systems to ensure that transaction and account information is consistent.

When it is time to completely move over to a new system, such as a new cloud deployment, dropping connections to the old system service-by-service allows for finding leftover dependencies in an accelerated but safe way. This saves engineering resources without compromising uptime.

Modernize the customer experience

Financial institutions are feeling the competitive pressure from each other and FinTechs who are pushing the limits of customer experience, while simultaneously cutting fees. Innovation is a competitive advantage in these markets, however, a single poor experience can lead to lost customers. API-driven microservices architectures allow for rapid innovation, creating smaller, independent services that can be modernized independently, but they come with added complexity. Complexity without controls leads to incidents and outages.

In order to mitigate incidents, one of the tenants of microservices architectures is that each service should be as decoupled as possible, such that they can fail independently and gracefully without impacting other services. Adding in a news feed with analyst reports to trading applications creates an improved customer experience, but if that news feed fails to load, that should not prevent customers from buying and selling stocks during a market event. An outage with an external service provider, such as credit card processors, should not bring down our own systems.

Running network attacks on services can ensure that the many services that make up our applications are decoupled. Add latency up to fully blocking traffic to the news app while monitoring to make sure the trade workflow is not impacted. Add packet loss and blackhole traffic to the external payment processor and monitor for transactions being held up or lost. This ensures that the two systems can fail independently, allowing each system to be upgraded separately and at a much faster rate.

Meet internal compliance requirements

The banking industry faces some of the strictest requirements around data availability and, recently, application availability. Applying risk management to be aware of the tradeoffs of risks and costs for any decisions is a critical function of any financial firm. Many firms use strategies like governance, risk, and compliance (GRC) to align business goals with the tradeoffs. Meeting service level agreements (SLAs) with customers can be difficult if the infrastructure our applications run on degrades and we haven’t prepared.

In order to ensure our applications are meeting the standards put in place and are working as designed, it’s necessary to test these standards frequently. For example, ensuring the IT systems of companies purchased during an acquisition meet internal compliance standards, they should face similar scrutiny. Traditional QA testing isn’t enough to test for all of the failure modes that could break compliance.

Chaos Engineering provides a solution for upgrading system reliability testing. By shutting down instances or dropping connectivity to a region, firms can be sure that the redundancy they have in place is working properly and that data remains available at all times. Running periodic FireDrills to test incident management playbooks is the best way to update outdated playbooks and to train employees to improve the time it takes to recover from a business operation-impacting incident.

Conclusion

By proactively running experiments, financial institutions can be prepared for any situation and ensure that their systems, tools, and processes are designed and implemented properly to prevent or minimize outages. Systems designed but untested do not lead to reliability improvements.

These are just a few examples of the ways Chaos Engineering can help financial firms grow while lowering costs and remaining reliable. We add more depth and specificity in our white paper around systems to test and chaos experiments to run.

It is important while running chaos experiments in critical applications, such as core banking systems, to be safe and secure. We built Gremlin to be the most comprehensive, simple, safe, and secure Chaos Engineering platform. This allows developers of financial institutions to run more experiments and improve the reliability of their systems faster, leaving time for innovation.

December 19, 2023 - 5 min read

How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up…

How to fix Kubernetes init container errors

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it…