hardening-isnt-hardware.md

Hardening Isn't Hardware

1 June 2026·5 min read

A disk filled up and took my whole cluster down — because one cheap mini-PC was running everything that mattered. Hardening the homelab that followed wasn't about buying redundancy. It was about understanding the blast radius and shrinking it on the hardware I already had.

reliabilitykuberneteshomelabdisaster-recoveryarchitecture

A disk filled up and took my whole cluster down. The disk was the trigger. The actual problem was that one cheap mini-PC was running my only Kubernetes control plane, my only etcd member, my busiest worker node, and the database — all at once. Lose that box and there's nothing to fail over to, because everything that matters is on it.

That outage was the moment I stopped thinking about uptime and started thinking about blast radius: if any single host dies, how much goes with it? The honest answer was "all of it" — and I'd built it that way without noticing, one node at a time, each addition reasonable, the whole thing quietly converging on a single point of failure.

The constraint

I run this on a handful of second-hand NUCs. I wasn't going to buy my way out with redundant hardware, and I didn't want to — the interesting problem is what you can do without it. So I set a deliberately modest bar. Not zero-downtime, not automatic failover; that's a different budget. The bar was: a single host dying should cost me a manual reschedule, not the cluster and not the data. Shrink the blast radius, accept that recovery is something I do by hand.

That reframe is the whole job. The rest is consequences.

Spreading the single points of failure

The control plane and etcd were the obvious ones. A single etcd member is a single point of failure for the entire cluster API — lose it and nothing can talk to anything. So etcd went from one member to three, across three hosts, so losing any one keeps quorum.

The non-obvious part was the order. A half-grown etcd cluster — one member becoming two becoming three — is more fragile than a single member while the new ones are catching up, not less. So the sequencing mattered: move the heavy workloads off the overloaded host first, give myself headroom, then grow etcd into the space. Do it the other way round and you're expanding your most critical component on the host that's already the problem.

The backup that wasn't offsite

Everything was backed up to one NAS — one box, in one building. And the only thing that actually lived anywhere else was the encryption key, not the data it protects. So "offsite backup" was, on inspection, "I can decrypt nothing, remotely."

I added a real offsite copy: encrypted database dumps pushed to an object store I don't own the building for. Then — and this is the only part that counts — I proved it. Spun up a throwaway machine, installed a minimal cluster, and restored the database from the offsite copy alone, with none of the original infrastructure in reach. Row counts matched. The whole thing took 2 minutes 19 seconds. That number tells me more than any green "backup completed" ever has, because the backup completing was never the thing in doubt.

The trap in the capacity numbers

An earlier audit of mine said one node was underused — sitting at about half its memory, could probably be shrunk. It was wrong, and the live review caught it: that "spare" memory isn't spare. It's the landing zone. When the busy node dies, its workloads reschedule onto exactly that headroom. Shrink it and you haven't saved anything — you've moved the single point of failure to a new node and made the failover you're relying on impossible.

The real lever wasn't sizing, it was placement: the busy singletons — metrics, object storage, the storage provisioner, ingress — needed to live on different hosts, so no single death takes out more than one of them.

How I did it without making it worse

Re-architecting a live system is a good way to cause the exact outage you're trying to prevent. The thing that kept it safe was boring: audit first, write the plan down, then check every assumption in the plan against the running system before touching anything.

That last step earned its place every time. The doc-level plan was wrong in small, dangerous ways that only the live check caught — an offsite-storage checksum default the backend rejects, a database password that wasn't where the chart said it was, a backup job set to run in a timezone I didn't intend. Each one would have been a silent failure if I'd trusted the plan over the system.

What it actually was

None of this was new hardware. I bought nothing. It was looking honestly at what one host dying takes down, and rearranging until the answer stopped being "everything." Hardening — a homelab, or anything — isn't redundancy you buy. It's understanding your blast radius and shrinking it on the hardware you've already got, then proving the recovery works before the day you need it. The cluster will still go partly down if I lose the wrong box. It just won't take everything with it, and I've watched it come back.

← back to research