Skip to main content

Chaos Mesh — Chaos Engineering & Resilience Testing

Chaos Mesh lets you deliberately inject failures into your cluster — kill pods, delay network traffic, fill disks, crash nodes — to verify your system actually recovers the way you designed it to.


Why Chaos Engineering

"Our system is resilient" means nothing without proof.

Without chaos testing:
→ You discover failure modes in production
→ Customers experience the downtime

With chaos testing:
→ You discover failure modes in controlled experiments
→ You fix the gaps before they become incidents

What You Can Inject

Experiment TypeExample
PodChaosKill random pods, crash containers
NetworkChaosAdd 200ms latency, 10% packet loss, partition nodes
StressChaosCPU spike, memory pressure on a node
IOChaosDisk read/write delays
NodeChaosShutdown a node (requires privileges)
TimeChaosSkew the clock on a pod
HTTPChaosInject 500 errors, delay responses

Install Chaos Mesh

helm repo add chaos-mesh https://charts.chaos-mesh.org

helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock

Wait for pods:

kubectl get pods -n chaos-mesh

Access the Dashboard

kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Open: http://localhost:2333

Create an account on first launch.


Experiment 1 — Kill Random Pod

Test that a deployment recovers automatically when a pod dies:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-nginx-pod
namespace: default
spec:
action: pod-kill
mode: one # kill one pod at a time
selector:
namespaces:
- default
labelSelectors:
"app": "nginx"
scheduler:
cron: "@every 2m" # kill a pod every 2 minutes
kubectl apply -f pod-kill.yaml
kubectl get pods -n default --watch
# → Pod killed → k8s creates replacement → Ready

Pass condition: Pod is replaced within 30 seconds, service never goes down.


Experiment 2 — Network Latency Between Nodes

Simulate a slow network link between fast-skunk and set-hog:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: node-latency
namespace: default
spec:
action: delay
mode: all
selector:
nodes:
- fast-skunk
delay:
latency: "200ms"
jitter: "50ms"
correlation: "25"
direction: to
target:
selector:
nodes:
- set-hog
mode: all
duration: "5m"

What to observe: Does ArgoCD sync slow down? Does Prometheus miss scrapes? Do your apps timeout?


Experiment 3 — Memory Pressure

Simulate a pod consuming too much memory:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress
namespace: default
spec:
mode: one
selector:
namespaces:
- default
labelSelectors:
"app": "my-app"
stressors:
memory:
workers: 2
size: "4GiB"
duration: "3m"

What to observe: Does the OOMKiller kick in? Is the pod restarted? Do resource limits protect other workloads?


Steady State Hypothesis

Before each experiment, define what "healthy" means:

Hypothesis: "When one pod dies, the service stays up"

Probe:
- HTTP GET http://10.0.0.200/api/health → must return 200
- Response time < 500ms
- Check every 5 seconds during experiment

Blast radius:
- Only pods with label app=nginx
- Only 1 pod killed at a time
- Automatic rollback if probe fails

This structure (from the Chaos Engineering book) ensures you learn something and don't cause uncontrolled damage.


GameDays

Schedule regular GameDays — planned chaos sessions with the whole team:

Monthly GameDay format:
1. Choose a failure scenario
2. Define the steady state
3. Run the experiment (Chaos Mesh)
4. Observe metrics in Grafana
5. Debrief: what broke, what held, what to fix
6. Create tickets for gaps found

Done When

✔ Chaos Mesh pods Running
✔ Dashboard accessible
✔ First pod-kill experiment completed
✔ System recovered within SLO
✔ Grafana shows the anomaly and recovery