Scream Aim Fire: Chaos Engineering Theory

Estimated difficulty: 💜💜💜💜🤍

Happy Friday, everyone! This week, I’m introducing something I’ve been interested in for a while – chaos engineering. If you haven’t heard of chaos engineering before, don’t worry – a lot of people haven’t, and it’s still quite new as a discipline. I first heard of chaos engineering at the AWS Summit in 2019, when Adrian Hornsby (Senior Technical Evangelist at Amazon Web Services) delivered a talk called Creating Resiliency Through Destruction. I’ll link the talk, and some other work, at the bottom of this post if you’d like to check it out or read a bit more.

What is chaos engineering and why do we do it?

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
From Principles of Chaos Engineering.

We use microservices and distributed software systems, Infrastructure as a Service, Platform as a Service, and Software as a Service (IaaS, PaaS, SaaS) now, more than ever. The advent and adoption of modern public cloud infrastructure makes it easier than ever to deploy a large-scale, global enterprise network in no time at all – and that’s pretty amazing.

In our current threat landscape, one pressing concern we have is operational resilience. What sort of demand can our systems withstand? Do we have the capability to withstand terabyte-scale DDoS attacks? What happens to our ability to digitally service our customers and users, if there’s a large-scale failure of infrastructure hosted by our cloud provider?

In information security, we try to manage and mitigate the risks associated with loss of confidentiality, integrity, or availability of information systems and data. There are plenty of sub-disciplines and activities concerned with providing assurance of your security posture, ranging from an audit, to a penetration test, or a more comprehensive red team exercise that includes things like social engineering campaigns and simulates real-world techniques used by adversaries. If you’re a security person, you can think of chaos engineering as kind of like a penetration test that’s designed to test the availability of your systems instead of overall security. If you’re an operations or other IT person, you can think of it as extreme stress testing of your systems. It’s kind of like purple teaming, but for operational uptime and availability instead of security.

The aim of the game is basically to try to break your systems in production, so that you can continuously re-architect and re-build applications to be more resilient, and ensure that in the event of a genuine service outage, your customers or users experience as little impact as possible.

Where does chaos engineering come from?

Werner Vogels, the CTO at AWS, says that “Everything fails, all the time,” and you should architect your systems accordingly. Chaos engineering was pioneered during the migration of Netflix services to Amazon Web Services in 2011. Greg Orzell (you can find him on GitHub here) had the idea to design a tool that would deliberately simulate failure in production to force developers and architects to build resilient systems by default. This is where the Simian Army tool suite – and specifically, Chaos Monkey (more on these below) – were created.

A little about AWS, in case you’re not familiar…

AWS is the largest public cloud provider (31% market share), closely followed by Microsoft Azure (20%) in 2020. Some organisations will choose one cloud provider, others will deploy a disaster recovery environment with a secondary provider, or design a hybrid cloud solution that integrates multiple providers, and potentially on-premise equipment.

AWS has a pretty sizeable catalogue of services, and it’s growing all the time. The in-built security capabilities (once you get your head around the differences between contemporary security tools and AWS) are immense. You can architect your cloud environment for “High Availability”, set up cross-region replication of content, set up autoscaling groups to help your EC2 instances (basically virtualised servers) to cope with either predictable or sudden spikes in demand, and even use the AWS content delivery network (CloudFront) to serve premium or subscription content to specific users. They also have things like DDoS protection included by default – AWS Shield, which in February mitigated a 2.3Tbps DDoS attack.

In earlier days, a lot of the security features that AWS now offer were either nonexistent or less mature, and they’ve developed over time. Simian Army provided mechanisms for monitoring security posture and alerting (the need for which has since been negated by the likes of AWS Inspector and GuardDuty), as well as Chaos Monkey. If you check out the ReadMe and release notes on the Simian Army GitHub, Netflix stopped updating it at the end of 2018 – when the AWS-native security and monitoring tools had made some elements of Simian Army redundant – and they split Chaos Monkey (et al) out into a separate repository.

So how do we do this?

Looking at a specific few tools used by Netflix; there’s Chaos Monkey (GitHub repository here), Chaos Gorilla, and Chaos Kong. Chaos Monkey randomly terminates an EC2, Chaos Gorilla simulates an availability zone failure, and Chaos Kong simulates a region failure. (Side note: the Netflix tools are cloud agnostic, so should work on any Spinnaker-managed cloud deployment.) The latter two aren’t available on GitHub, probably because they’d potentially be quite damaging if used by a malicious insider against infrastructure not architected well enough to recover. You can see the basic principles by checking out the Chaos Monkey repository above.

I have a second part to this blog currently in the works, where I’ll demonstrate how Chaos Monkey works.

Who else does this?

There are other organisations who use chaos engineering to improve the resiliency of their infrastructure – Facebook’s Project Storm being one example. Facebook, you may be aware, use entirely proprietary technology, so there isn’t a great deal of detail readily available about their chaos engineering tooling, but there are a few interesting articles about resiliency at Facebook – like Meet Project Storm, Facebook’s SWAT team for disaster-proofing data centers.

From what I’ve read so far, the vast majority of organisations who use chaos engineering principles are technology and media companies. I’m mostly writing this because I think chaos engineering is really cool, but there are definitely lessons other sectors and organisations can take from this. The advent of modern public cloud infrastructure and increasing reliance on material providers of these services, as well as increased scrutiny and focus on operational resilience in an evolving threat landscape where critical national infrastructure is commonly targeted by threat actors, highlights the need to adapt our systems to be able to weather any storm. There are really cool examples of this in the wild, and I’d love to see chaos engineering adopted more widely outside the technology and media industries.

Cool things I’ve read/watched related to chaos engineering:

Gremlin – Chaos Engineering History, Principles, and Practice

@Scale Conference 2019 – How AWS prevents and recovers from operational events

How to Use Chaos Engineering to Break Things Productively

Creating Resiliency through Destruction

Adrian Hornsby’s Medium Articles

Serverless Chaos demos

This GitHub repo has an abundance of resources on chaos engineering (seriously it’s brilliant), if you exhaust the above and want to read/watch more or try it out yourself.

As always, thank you for reading! If you have any questions, or read anything cool on chaos engineering that you’d like to share, please let me know in the comments.

Morgan x

Leave a Comment Cancel