Lessons from Backing Up Kubernetes the Hard Way

Martin Hollingsworth

So, I learned the painful truth that “Kubernetes makes everything stateless” is only half true — until you lose your cluster state. Backing up etcd is absolutely essential, and not just once; automate it and store it off-cluster. For persistent volumes, use CSI snapshots or Velero with a reliable object storage target — don’t rely on manual PVC exports, they’ll bite you later. I also found that backing up manifests from GitOps repos isn’t enough if your secrets aren’t versioned properly. Encrypt them, store the keys safely, and test your restore in a clean namespace or even a new cluster. Backups that haven’t been restored are just pretty files sitting in S3.

Alistair

Totally agree — and don’t forget to test node-level failures too. etcd might look fine on paper, but a corrupt member or a bad snapshot timestamp can ruin your day. A quick periodic restore test in a throwaway cluster has saved me more than once.

Julian Moreland

And yep — if you haven’t run a full cluster restore rehearsal, you don’t really have a backup; you just have hopes.

Damian Montrose

Seen this too many times — folks skip etcd backups and only realize it matters when the control plane won’t come back.

Malcolm

Totally agree! I learned the hard way that just backing up manifests isn’t enough. Automating etcd backups and testing restores in a fresh cluster is a lifesaver. Also, Velero + reliable object storage has saved me more than once when PVs got corrupted — can’t stress off-cluster storage enough.