Table of Contents
Development and staging environments where applications are developed are usually very isolated, which can make it difficult to predict and prepare for the various issues that can occur in a production environment. Addressing this challenge was critical for Netflix in order to make their service more scalable and reliable in serving customers around the world. To solve this, Netflix turned to chaos theory and created chaos engineering, a method of introducing random failures in any part of the system to test and improve its resilience. The key principle behind this approach is that it is better to find and fix issues in a controlled environment than to wait for them to occur in production, which could lead to catastrophic failures.
But do we really need chaos engineering? The answer is undeniably yes. As technology becomes more complex and interdependent, the scope and potential for failures increase. Chaos engineering allows you to proactively identify and fix vulnerabilities before they cause problems for your users. It also helps build confidence in your systems by knowing that they can withstand unexpected failures.
In this roundup, you'll learn about five popular chaos engineering platforms: Litmus, Chaos Mesh, Gremlin, ChaosBlade, and Chaos Monkey. The chosen platforms are widely used and have a strong presence in the chaos engineering community. They also offer a tremendous range of features that can be used for various use cases. For instance, some of these platforms can be used across different environments, such as cloud-native, on-prem, and hybrid. Most of these platforms are open source but offer a managed, commercial version. This means that companies of different shapes and sizes can use these tools according to their means.
LitmusChaos
LitmusChaos is an open source chaos engineering platform for Kubernetes. This tool was built by MayaData, a company that offered specialized storage solutions for cloud-native environments. It's currently a CNCF Sandbox project that's been adopted across many organizations.
Litmus works by running chaos experiments in a Kubernetes cluster. A chaos experiment is a fundamental unit in LitmusChaos architecture. An experiment allows you to inject failures at different system layers, including storage, network, compute, and so on. You can find many common experiments in the ChaosHub experiment registry.
One of Litmus's key features is its comprehensive set of APIs for automating chaos workflows, which allows you to easily integrate chaos testing into your continuous integration and deployment pipelines. Litmus also has a user-friendly web UI for creating and managing chaos experiments.
To learn more about LitmusChaos, head to the official documentation. You can also check out the official Litmus YouTube channel for tutorials.
Chaos Mesh
Like LitmusChaos, Chaos Mesh is also a cloud-native open source chaos engineering platform. It was created by PingCAP, the company behind the popular distributed HTAP database called TiDB. Currently, it's an incubating project at the Cloud Native Computing Foundation (CNCF).
PingCAP's experiences from building TiDB—a scalable, distributed database that handles all kinds of data workloads (transactional and analytical)—heavily influenced the creation of Chaos Mesh. You can use Chaos Mesh to inject three broad fault categories: basic resource (eg, DNSChaos), platform (eg, AWSChaos), and application-layer (eg, JVMChaos).
You can create full-fledged workflows that run multiple chaos experiments and manage individual experiments and whole workflows visually using the Chaos dashboard. Chaos Mesh also integrates with GitHub Actions using chaos-mesh-action
, allowing you to integrate Chaos Mesh within your customized continuous integration workflow.
There's much more to Chaos Mesh. You can learn all about it by going through the official documentation or the official blog.
Gremlin
Gremlin is a managed chaos engineering platform that lets you inject failures at various layers of your system. Gremlin lets you inject chaos into any kind of environment, including cloud, on-premise, and hybrid. Gremlin Fault Injection (FI), its popular fault injection library, lets you inject controlled failures into application services, hosts, and containers.
Unlike many other tools, especially open source ones, Gremlin offers an end-to-end platform to proactively improve the reliability of your system. Gremlin uses many different methods, such as golden signals, reliability tests, and reliability scores, in the pursuit of better reliability. Gremlin also lets you organize events like GameDays, which are based on the principles of chaos engineering, enabling you to work towards a more reliable system.
Gremlin offers a Python SDK, which is currently in the alpha testing phase. To read more about Gremlin, please visit the official documentation. Gremlin also offers a rich repository of community tutorials, such as the tutorial on how to run a GameDay using Gremlin.
ChaosBlade
ChaosBlade is Alibaba's take on chaos engineering. It's an open source chaos engineering toolkit that, much like Chaos Mesh and LitmusChaos, allows you to run chaos experiments and inject failures at different levels of a software system. ChaosBlade's chaos experiments are broadly divided into three categories: physical host (CPU, disk, and network); Kubernetes (pod, node, and container); and Java (JVM, Java agent, and so on).
ChaosBlade, which is a Sandbox project at CNCF, also offers integration with other tools like LitmusChaos and Chaos Mesh via its ChaosBlade-Box platform. Integration with different chaos engineering tools allows you to access more chaos experiments, recipes, and functionality. This is a big plus for ChaosBlade.
To learn more about ChaosBlade, please visit the official documentation or the official blog.
Chaos Monkey
Chaos Monkey started appearing in the news over a decade ago when Netflix began to test its AWS infrastructure after facing significant degradation of the viewing experience of a large number of customers in a particular geographical region on Christmas eve in 2012. Something outside of Netflix's control, something they'd not planned for, happened, as "data was deleted by a maintenance process that was inadvertently run against the production ELB state data." Chaos Monkey was developed in the aftermath of this incident; the development of Netflix's new tool gave birth to a new domain of engineering called chaos engineering.
When Chaos Monkey was first released within Netflix, it wasn't appreciated much: "Netflix lore says that this was not instantly popular. There was a short period of time when ICs grumbled about Chaos Monkey. But it seemed to work, so more and more teams eventually adopted it." Now, not only has it become a full-fledged open source project, but it has also inspired several other companies to build their own chaos engineering solutions and platforms.
Over the years, it has inspired teams within Netflix to create several other related tools to tackle the problem of resilience. Unsurprisingly, many of these tools were initially built with AWS in mind. Since then, Chaos Monkey has inspired many other tools—previously discussed in this article—to cater to different deployment environments and systems.
Netflix's approach to chaos engineering has evolved over the years. Two engineers from Netflix gave a talk at AWS re:Invent 2022 about that journey. In addition to Chaos Monkey, Netflix also uses Kayenta for canary analysis, Zipkin for tracing, and Envoy for fault injections.
To learn more about Chaos Monkey, you can read Netflix's official blog on Medium.
Conclusion
To reiterate, chaos engineering tests the resilience of your systems by purposefully injecting failures and disturbances of certain types. This methodology can be used to identify and mitigate potential shortcomings in a system before they show up in a real-world scenario. These shortcomings could be related to the capacity of a system, the security, the quality, or any number of other things that define it. Chaos engineering attempts to minimize the impact of such occurrences.
This roundup looked at five popular chaos engineering platforms: Litmus, Chaos Mesh, Gremlin, ChaosBlade, and Chaos Monkey. Each of these platforms has its own unique features and is suitable for different use cases. Whether you are working with Kubernetes, cloud-native systems, or large-scale distributed systems, a chaos engineering platform can help you make your systems more reliable by mimicking the unpredictability of the real world and putting them to the test.