Skip to main content

Etcd Backup & Node Recovery

Etcd is the database of your Kubernetes control plane — it stores the entire cluster state (all deployments, services, secrets, configmaps). If set-hog (your control plane) dies and etcd data is lost, the cluster is gone.

This page covers automated etcd snapshots and the full recovery procedure.


k3s Etcd Snapshot

k3s has built-in etcd snapshot support — no extra tools needed.


Manual Snapshot

ssh ubuntu@10.0.0.2

sudo k3s etcd-snapshot save \
--name manual-$(date +%Y%m%d-%H%M%S)

Snapshots are saved to:

/var/lib/rancher/k3s/server/db/snapshots/

Automatic Scheduled Snapshots

Enable in k3s config:

sudo nano /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "0 */6 * * *" # every 6 hours
etcd-snapshot-retention: 10 # keep last 10 snapshots
etcd-snapshot-dir: /var/lib/rancher/k3s/server/db/snapshots

Restart k3s:

sudo systemctl restart k3s

Verify snapshots are being created:

sudo k3s etcd-snapshot list

Copy Snapshots Off-Node

Snapshots on the node itself don't help if the node dies. Copy them to a safe location:

# From your local machine via Tailscale
rsync -avz ubuntu@10.0.0.2:/var/lib/rancher/k3s/server/db/snapshots/ \
~/k3s-snapshots/

# Or to MinIO (if Velero/MinIO is set up)
mc alias set minio http://10.0.0.200:9000 minioadmin minioadmin
mc mirror /var/lib/rancher/k3s/server/db/snapshots/ minio/etcd-snapshots/

Restore Procedure

Scenario: set-hog is dead, new node provisioned

1. Provision new control plane node via MAAS

Assign IP 10.0.0.2 (or update workers with new IP).

2. Copy snapshot to new node

scp ~/k3s-snapshots/latest.db ubuntu@10.0.0.2:/tmp/

3. Restore from snapshot

ssh ubuntu@10.0.0.2

sudo k3s server \
--cluster-reset \
--cluster-reset-restore-path=/tmp/latest.db

4. Start k3s normally

sudo systemctl start k3s
sudo kubectl get nodes

Workers will reconnect automatically (they keep trying).

5. Verify

sudo kubectl get all --all-namespaces

All namespaces, deployments, and services should be restored.


Snapshot Schedule Recommendation

FrequencySnapshotRetention
Every 6 hoursOn-node10 snapshots (2.5 days)
DailyCopied to MinIO30 days
WeeklyCopied to external drive / S312 weeks

Recovery Time Objectives

ScenarioRecovery StepsEstimated Time
Control plane crash (data intact)Restart k3s2–5 min
Control plane disk failure (snapshot available)New node + restore15–25 min
Full cluster wipeMAAS reprovision + k3s + restore45–60 min

Done When

✔ Automatic snapshots running every 6 hours
✔ Snapshots copied to MinIO daily
✔ Test restore verified on a staging node
✔ Recovery runbook documented and tested