Etcd Backup & Node Recovery

Etcd is the database of your Kubernetes control plane — it stores the entire cluster state (all deployments, services, secrets, configmaps). If set-hog (your control plane) dies and etcd data is lost, the cluster is gone.

This page covers automated etcd snapshots and the full recovery procedure.

k3s Etcd Snapshot

k3s has built-in etcd snapshot support — no extra tools needed.

Manual Snapshot

ssh ubuntu@10.0.0.2

sudo k3s etcd-snapshot save \
  --name manual-$(date +%Y%m%d-%H%M%S)

Snapshots are saved to:

/var/lib/rancher/k3s/server/db/snapshots/

Automatic Scheduled Snapshots

Enable in k3s config:

sudo nano /etc/rancher/k3s/config.yaml

etcd-snapshot-schedule-cron: "0 */6 * * *"   # every 6 hours
etcd-snapshot-retention: 10                    # keep last 10 snapshots
etcd-snapshot-dir: /var/lib/rancher/k3s/server/db/snapshots

Restart k3s:

sudo systemctl restart k3s

Verify snapshots are being created:

sudo k3s etcd-snapshot list

Copy Snapshots Off-Node

Snapshots on the node itself don't help if the node dies. Copy them to a safe location:

# From your local machine via Tailscale
rsync -avz ubuntu@10.0.0.2:/var/lib/rancher/k3s/server/db/snapshots/ \
  ~/k3s-snapshots/

# Or to MinIO (if Velero/MinIO is set up)
mc alias set minio http://10.0.0.200:9000 minioadmin minioadmin
mc mirror /var/lib/rancher/k3s/server/db/snapshots/ minio/etcd-snapshots/

Restore Procedure

Scenario: set-hog is dead, new node provisioned

1. Provision new control plane node via MAAS

Assign IP 10.0.0.2 (or update workers with new IP).

2. Copy snapshot to new node

scp ~/k3s-snapshots/latest.db ubuntu@10.0.0.2:/tmp/

3. Restore from snapshot

ssh ubuntu@10.0.0.2

sudo k3s server \
  --cluster-reset \
  --cluster-reset-restore-path=/tmp/latest.db

4. Start k3s normally

sudo systemctl start k3s
sudo kubectl get nodes

Workers will reconnect automatically (they keep trying).

5. Verify

sudo kubectl get all --all-namespaces

All namespaces, deployments, and services should be restored.

Snapshot Schedule Recommendation

Frequency	Snapshot	Retention
Every 6 hours	On-node	10 snapshots (2.5 days)
Daily	Copied to MinIO	30 days
Weekly	Copied to external drive / S3	12 weeks

Recovery Time Objectives

Scenario	Recovery Steps	Estimated Time
Control plane crash (data intact)	Restart k3s	2–5 min
Control plane disk failure (snapshot available)	New node + restore	15–25 min
Full cluster wipe	MAAS reprovision + k3s + restore	45–60 min

Done When

✔ Automatic snapshots running every 6 hours
✔ Snapshots copied to MinIO daily
✔ Test restore verified on a staging node
✔ Recovery runbook documented and tested

k3s Etcd Snapshot​

Manual Snapshot​

Automatic Scheduled Snapshots​

Copy Snapshots Off-Node​

Restore Procedure​

Scenario: set-hog is dead, new node provisioned​

Snapshot Schedule Recommendation​

Recovery Time Objectives​

Done When​