Chaos Engineering with Chaos Mesh and vCluster: Testing Close to Production

Jubril Oyetunji

January 10, 2025

12 Minute Read

This is some text inside of a div block.

Pioneered at Netflix over a decade ago, chaos engineering is a term used to describe the organized and intentional introduction of failures into systems in pre-production or production to understand their outcomes better and plan effectively for these failures.

In a previous articles we have highlighted tools which can help you perform chaos engineering. In this article, we will go one step further and take a look at how vCluster can help you test as close to production by using Litmus.

The Importance of Testing Close to Production

Before diving deeper into the technicalities, it's essential to understand why testing close to production is critical. Traditional pre-production environments often require setting up a separate staging area, deploying the application, and running various tests such as load testing or failover simulations.

In Kubernetes environments, this typically means spinning up a separate cluster, often with limited resources to cut costs. While this setup is practical, it rarely mirrors the true production environment. As Netflix put it, "Learn with real scale, not toy models."

This is where vCluster shines. Instead of spinning up a separate environment, vCluster lets you leverage underlying host resources. You can create virtual Kubernetes clusters within a single physical cluster. This enables you to test applications at a near-production scale without the overhead of managing a full cluster.

For operations teams, this means monitoring whether Horizontal Pod Autoscaler (HPA) rules kick in as expected or validating the effectiveness of network policies. For developers, it provides insights into application behavior under real-world scenarios.

What is Chaos Mesh?

Chaos Mesh is an end-to-end chaos engineering platform for cloud-native infrastructure and applications. It is a CNCF incubating project and aims to help both developers and SREs gain meaningful insights from their applications from their applications.

Prerequisites

This tutorial assumes some familiarity with Kubernetes; additionally, you will need the following installed locally to follow along.

Create a virtual cluster

Begin by creating a virtual cluster. For this demonstration, let's consider that the infrastructure team is starting to introduce chaos mesh into your pre-prod cluster. Later in this tutorial, you will use Horizontal Pod Autoscaler (HPA). Because of this, you need to explicitly allow vCluster to sync metrics from the host cluster into your virtual cluster.

Create a custom vCluster config:

cat <<EOF > values.yaml
integrations:
  metricsServer:
    enabled: true
    nodes: true
    pods: true
EOF

Create a virtual cluster:

vcluster create pre-prod -n pre-prod -f values.yaml

Output is similar to:

11:48:27 warn There is a newer version of vcluster: v0.21.1. Run `vcluster upgrade` to upgrade to the newest version. 11:48:27 info Creating namespace pre-prod 11:48:27 info Detected local kubernetes cluster minikube. Will deploy vcluster with a NodePort & sync real nodes 11:48:27 info Create vcluster pre-prod... 11:48:27 info execute command: helm upgrade pre-prod /var/folders/gw/gd4m32rs5k5chjbf761w3zrw0000gn/T/vcluster-0.20.1.tgz-1751275844 --create-namespace --kubeconfig /var/folders/gw/gd4m32rs5k5chjbf761w3zrw0000gn/T/689082436 --namespace pre-prod --install --repository-config='' --values /var/folders/gw/gd4m32rs5k5chjbf761w3zrw0000gn/T/754764801 --values values.yaml 11:48:28 done Successfully created virtual cluster pre-prod in namespace pre-prod 11:48:30 info Waiting for vcluster to come up... 11:48:50 done vCluster is up and running 11:48:50 info Starting proxy container... 11:48:52 done Switched active kube context to vcluster_pre-prod_pre-prod_minikube - Use `vcluster disconnect` to return to your previous kube context - Use `kubectl get namespaces` to access the vcluster

Install Chaos Mesh

With a virtual cluster created, the next step is to install chaos mesh inside your virtual cluster. You can do this by using the helm chart:

helm repo add chaos-mesh https://charts.chaos-mesh.org

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=docker --set chaosDaemon.socketPath=/var/run/docker.sock --version 2.7.0 --create-namespace

Chaos Mesh requires access to the container runtime socket, which varies depending on your Kubernetes cluster. This tutorial was written using Minikube, where the runtime is Docker, and the socket path is /var/run/docker.sock.

Most Kubernetes clusters use Containerd by default. For such environments, you should use:

helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh \  
  --set chaosDaemon.runtime=containerd \  
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \  
  --version 2.7.0

For clusters like K3s, MicroK8s, or those using CRI-O, refer to this section of the documentation.

Verify your installation

kubectl get po -n chaos-mesh

Output is similar to:

NAME                                        READY   STATUS    RESTARTS        AGE
chaos-controller-manager-867f657d95-8pcx6   1/1     Running   0               13m
chaos-controller-manager-867f657d95-d4hbs   1/1     Running   0               13m
chaos-controller-manager-867f657d95-h9fwv   1/1     Running   0               13m
chaos-daemon-2rgcj                          1/1     Running   0               13m
chaos-dashboard-7c66c9f685-h9sq6            1/1     Running   0               13m
chaos-dns-server-69dd8595bf-zsm2l           1/1     Running   0               13m

Your First Chaos Experiment

With Chaos Mesh installed, you’re almost ready to write your first experiment. Before diving in, let’s clarify some terminology. In Chaos Mesh, an experiment refers to a test designed to introduce controlled failures into your system. Experiments are divided into two categories:

Kubernetes-level experiments: These target resources like pods, deployments, or services within your Kubernetes cluster.
Physical node-level experiments: These focus on the underlying infrastructure, such as CPU stress or network disruptions on the nodes themselves.

In this tutorial, you’ll run two experiments—one targeting Kubernetes resources and another targeting a physical node. Each experiment will be created using a Kubernetes Custom Resource (CR) based on Chaos Mesh’s Custom Resource Definitions (CRDs).

PodChaos

One of the easiest experiments to write is PodChaos. This experiment randomly terminates a pod at a predefined interval, allowing you to simulate unexpected failures in your application. This can be particularly useful when testing consensus-based applications such as databases, message brokers, or distributed storage systems.

By using the PodChaos experiment in a virtual cluster, you can evaluate how well your application maintains quorum, the minimum number of nodes required to make progress. For example, you can test if your application continues to handle reads and writes effectively when a pod is unexpectedly terminated.

Begin by creating new deployment:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
        - name: whoami
          image: traefik/whoami
          ports:
            - containerPort: 80 
EOF

This will create a new deployment called whoami in the default namespace. Verify all pods are running correctly using:

kubectl get pods

Output is similar to:

whoami-76c9859cfc-lsxkm   1/1     Running             0          4s

Create a PodChaos experiment:

Next, apply the following manifest to create a PodChaos experiment:

kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-example
  namespace: default
spec:
  action: pod-failure
  mode: one
  duration: '120s'
  selector:
    labelSelectors:
      app: whoami
EOF

The manifest above defines a PodChaos experiment that runs for two minutes(120s) which is specified using the duration field, the mode is also set to one which will terminate one pod at a time for the duration of the experiment. Finally using selectors, specify the target, which are pods with the label app=whoami .

You can immediately verify the experiment is active by running:


kubectl get pods -w

Adding the -w flag will watch the namespace for any events, such as pod restarts. Output is similar to:

‎Slowly but surely you should see pods start to get restarted which means the test is active.

Accessing Test Results

While you could view the test was performed through the CLI, this is by no means ideal. Chaos Mesh provides a dashboard for viewing and analyzing all test results.

To access the dashboard, you first need to generate credentials. The following manifest creates a ServiceAccount, a ClusterRole with the required permissions, and a ClusterRoleBinding to link the two. Apply the manifest using the command below:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: account-cluster-manager
  namespace: default

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: role-cluster-manager
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["chaos-mesh.org"]
  resources: ["*"]
  verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: bind-cluster-manager
subjects:
- kind: ServiceAccount
  name: account-cluster-manager
  namespace: default
roleRef:
  kind: ClusterRole
  name: role-cluster-manager
  apiGroup: rbac.authorization.k8s.io
EOF

Generating a Token

Once the manifest is applied, generate a token for the ServiceAccount using the following command:

kubectl create token account-cluster-manager -n default

This should output a token for the service account you just created. Save this to a secure location and run the following command to expose the dashboard:

kubectl port-forward svc/chaos-dashboard -n chaos-mesh 2333:2333

Navigating to http://localhost:2333 , should return a page like this:

Enter the credentials you just created, and you will be logged in.

StressChaos

The StressChaos CRD allows you to stress-test your application’s CPU or memory, simulating resource-intensive conditions. This is particularly useful for verifying metric-based scaling mechanisms, such as Kubernetes Horizontal Pod Autoscalers (HPA), to ensure they trigger under the correct conditions

Verify or Install Metrics Server

The metrics-server component is critical for collecting resource metrics like CPU and memory usage, which are required for HPA to function. To check if it’s installed, run the following command on your host cluster:

kubectl get deployment metrics-server -n kube-system

If the deployment is not found on your host cluster, run the following commands after disconnecting from your virtual cluster:

vcluster disconnect

Install metrics-server:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
rules:
- apiGroups: [""]
  resources: ["nodes/metrics", "pods", "nodes"]
  verbs: ["get", "list", "watch"]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      serviceAccountName: metrics-server
      containers:
      - name: metrics-server
        image: registry.k8s.io/metrics-server/metrics-server:v0.7.2
        args:
        - --cert-dir=/tmp
        - --secure-port=10250
        - --kubelet-insecure-tls
        - --metric-resolution=15s
EOF

Reconnect to your virtual cluster:

vcluster connect pre-prod

Configure Horizontal Pod Autoscaler

Next, apply a HPA to scale the whoami deployment based on memory usage. The following manifest scales the pods between 1 and 8 replicas when the average memory usage exceeds 300Mi:

kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: whoami-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: whoami
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 300Mi
EOF

Apply the StressChaos Experiment

Finally, create a StressChaos experiment to simulate memory pressure:

kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: mem-stress
  namespace: default
spec:
  mode: all
  selector:
    namespaces:
    - default
  stressors:
    memory:
      workers: 3
      size: 600MiB
      options: []  # Use an empty list for options
  duration: '120s'
EOF

The manifest above defines a StressChaos experiment that runs for 2 minutes (120s), as specified by the duration field. This experiment is configured in mode: all, meaning it will apply the stress test to all pods within the selected namespace(s).

Using the selector field, the experiment targets all pods in the default namespace. The stressors section specifies a memory stress test with the following configuration:

Workers: 3 — Three concurrent workers will be deployed to simulate the memory load.
Size: 600MiB — Each worker will attempt to allocate 600 MiB of memory.

As soon as you apply the manifest, you should see the HPA kick in. Verify this using:

kubectl get pods

Output is similar to:

You can also check on the HPA directly:

kubectl get hpa

Output is similar to:


NAME         REFERENCE           TARGETS                   MINPODS   MAXPODS   REPLICAS   AGE
whoami-hpa   Deployment/whoami   memory: 639655936/300Mi   1         8         1          9h

Finally, you can take a look at more detailed results of the experiments by navigating to http://localhost:2333/#/experiments , be sure to expose the dashboard if you haven’t already:

kubectl port-forward svc/chaos-dashboard -n chaos-mesh 2333:2333

Select the mem-stress experiment, and you will be greeted with a more detailed view of the experiment:‎

‎This will show information such as the duration, namespace . Kind of experiment that was run ,events that occurred during the experiment and the manifest used. For the memory stress test, it displays what namespace it was run in, along with the number of workers(6 in this case) how much memory each worker used(600MiB) and the duration 1200

Conclusion

Chaos Engineering forces us to embrace failures and treat them as inevitable instead of hoping for the best outcome. With vCluster, teams can take this further and perform these tests as they would occur in production with no additional infrastructure costs.

While this blog covered only two experiments, Chaos Mesh has an extensive library of tests that you can leverage. Here are some suggestions for more chaos engineering content.

Don’t forget to check out the vCluster documentation for more details! Join our Slack here.

vCluster