Upgrading My Main Kubernetes Cluster, Top to Bottom

I spent this weekend upgrading phoenix, the k3s cluster that runs most of what I self-host. It was overdue. I'd been putting it off for months, honestly holding out for a proper upgrade path to land in the repo I originally built the cluster from. It never came, so I stopped waiting, set aside a clear weekend, and did it myself.

It ended up being a full refresh, top to bottom: the Debian install under every node, k3s itself, the two bits of cluster plumbing that same repo set up and I'd never touched since, and Flux at the very top. Twenty nodes, all of it rolling. Cordon a node, upgrade it, wait for it to come back healthy, move to the next one. The apps stayed up the whole time. Mostly.

That "mostly" is why I'm writing this. If you hit a slow load or a blip this weekend, or something just didn't answer for a minute, that was me. Sorry. It should all be faster and healthier now than it was before I started.

What the cluster is

Phoenix is 20 k3s nodes. The name isn't the city — the boxes actually live in Houston. Partly it's the pods, which die and get reborn over and over, like the bird. But mostly it's the cluster itself. I've been trying to run Kubernetes on my own hardware for years, and every attempt before this one got torn down and rebuilt from the ashes of the last. Phoenix is the one that finally stuck, so the name earned itself.

The nodes are all VMs on my Proxmox hosts in Houston. Five run the control plane, five run storage on Longhorn, the rest are just compute. On top of that:

Flux does the GitOps. It watches the main branch of my ops repo and reconciles it into the cluster, so "deploy" just means "merge."
kube-vip keeps the control-plane API on one virtual IP, 192.168.82.20, and fails it between the masters on its own.
MetalLB hands out real LAN IPs to every LoadBalancer service, native L2, from the 192.168.82.200-254 pool.

kube-vip and MetalLB are the interesting part here. Both came in with that original bootstrap, dropped in as k3s auto-deploy add-ons, and I never thought about them again. They just sat there on the versions the repo shipped, owned by nothing after that. That comes back around later.

Where this started

Like my Patroni Postgres cluster, phoenix started life from TechnoTim's k3s-ansible, which is a great way to get a cluster on its feet. There hasn't been a fresh upgrade pushed to that repo in a while, though, so getting the cluster off the versions it shipped with was on me. So I took the hard road: a bit of Claude, a stack of Ansible playbooks, and a lot of patience.

Why now

Two things forced it. k3s 1.30 had gone end-of-life, and I don't love running something that's stopped getting security fixes. On top of that, I had a Flux 2.9 bump sitting in an open Renovate MR that flat-out refused to go in until the cluster was on Kubernetes 1.34.1 or newer.

While I was in there, the nodes were also still on Debian 12. Trixie (13) has been out and stable for a while now. I wasn't going to reboot twenty nodes for k3s and then reboot them all again a month later for the OS, so I folded everything into one weekend.

One node at a time

Every piece of this followed the same rule, because every piece of it can take the cluster down if you get greedy and do two nodes at once.

Control-plane nodes go one at a time, never two. After a master reboots I wait for it to come back Ready and for /healthz/etcd to report ok before I touch the next one. That way etcd never loses more than one member at a time, and the kube-vip API VIP always has somewhere to land.

Agents only start once every server is green, and they also go one at a time.

And everything is idempotent. Each node checks its own state first and skips itself if it's already done, so if a run dies halfway through I just run it again. It picks up where it left off and won't re-drain or re-reboot a node that already made it across. The nice thing about that is a failed run is a paused run, not a broken cluster.

One thing I noticed about Longhorn

When you take a Kubernetes node down for maintenance, the reflex is to drain it first so nothing's running on it when it goes. For a Longhorn storage node, that reflex works against you, and it's worth understanding why.

Draining a phx-strg node evicts its instance-manager pod. The second that pod dies, every replica it was hosting gets marked failed, and Longhorn does the only thing it can: it starts rebuilding all of them across the storage fleet. So a one-minute reboot turns into a full rebuild storm. Nothing's actually at risk — the data's safe, every volume rebuilds and comes back healthy — but the fleet churns for a while, and anything stateful feels slow until it settles.

That behavior is the whole reason the storage-node handling looks the way it does. Draining is for taking a node out permanently; it isn't for bouncing one. So a storage node gets cordoned instead — stop new pods landing on it, but leave the running replica where it is — then reboot, then wait for Longhorn to go green again before moving on.

The playbook does this on its own now. If a node has replicas it gets cordoned and health-gated; if it doesn't, it gets a full drain; one command handles the mixed fleet. I watched the rebuild storm play out directly this weekend, which is exactly why the cordon-only path is now baked in — and it's almost certainly why anything stateful felt a little slow while the storage fleet caught its breath.

Dealing with the frozen plumbing

Before I could climb k3s I had to sort out kube-vip and MetalLB, the two add-ons owned by nobody. They had to go first because a k3s upgrade regenerates the add-ons it manages but leaves anything it doesn't alone, so those two would have stayed frozen no matter how far I moved k3s. I also just wanted them on a proper lifecycle before I started moving the ground under them.

kube-vip went from v0.8.2 to v1.2.1. I pulled it out of the orphaned add-on slot and made it a managed manifest on the control-plane nodes — deliberately not under Flux, because it's the thing serving the API, and it can't depend on the API server it's fronting. The one gotcha on the jump to v1: they renamed the env var vip_cidr to vip_subnet. Same "32" value, new name. Miss it and the new pods come up without a working VIP. Everything else it reads was the same. The DaemonSet rolls one master at a time and the VIP fails over on its own, so you don't see anything from the outside.

MetalLB went from chart 0.14.8 to 0.16.1, and in the process under Flux as a real HelmRelease. This was the rough one. In the order things bit me:

frrk8s.enabled defaults to true in 0.16. I run plain L2, not BGP, so left alone the upgrade would have dropped an entire frr-k8s controller and DaemonSet I don't want. Force it false.
The chart version isn't the app version. Chart 0.16.1 ships controller image v0.16.1, a patch ahead of the app's own v0.16.0 release. Don't assume they match.
The CRDs are hand-managed and have to match the running version exactly. The controller builds a REST mapping for every MetalLB type at boot, BGP ones included, even in L2 mode, so a missing or stale CRD kills it with no matches for kind "ServiceBGPStatus" and a CrashLoopBackOff. The fix is to apply the matching CRDs straight from the tag: kubectl apply -k '…/config/crd?ref=v0.16.1'.
There's a race. Promoting the chart makes Flux run helm upgrade right away, and the new pods race the CRDs. They crashloop, the 5-minute window times out, and the HelmRelease sticks at Ready: False even after the pods sort themselves out once the CRDs show up. To recover: apply the CRDs, then flux reconcile helmrelease metallb --force.

MetalLB sits in front of every LoadBalancer in the cluster, so the whole time I had one thing on screen: is every service keeping its external IP? It was. But it was nowhere near as clean as kube-vip.

The k3s climb: 1.30 to 1.34

k3s doesn't let you skip minor versions, so this is a climb, not a jump: 1.30, 1.31, 1.32, 1.33, 1.34, one hop per run, green cluster in between each one.

The play swaps the k3s binary and restarts. It never re-runs the installer, which is the important bit — re-running the installer would rewrite the node and blow away the custom server flags from the original setup. Each hop snapshots etcd first, does the control-plane nodes one at a time behind the etcd health check, then the agents in two batches: compute drained, Longhorn storage cordon-only and health-gated, per the mistake above. If any node misbehaves, the whole run stops instead of pressing on.

I ran the hop on my little Rancher cluster first as a guinea pig. It's a single node, low stakes, and it needed 1.34 for Flux 2.9 anyway. Proving one hop end to end there before touching phoenix was worth it. Then phoenix, five hops, checking between each one that:

every node's Ready on the right version,
etcd's healthy on all the servers,
the kube-vip VIP still answers,
every MetalLB LoadBalancer kept its IP,
Flux is reconciling and nothing's stuck.

Both clusters ended up on 1.34.9+k3s1 with no failures on the climb itself. The storage-drain mess was before the climb, not during it. Twenty nodes, five hops, one command each.

Why none of this was scary

If you're sitting on that same TechnoTim 1.30 default, this is the part worth copying. Moving a 20-node cluster four versions in an afternoon was boring for a few specific reasons:

Swap the binary, don't re-run the installer. A k3s upgrade really is just a new binary and a restart. Re-running the install script rewrites the node and wipes your server flags, which is how an upgrade becomes an outage.
Snapshot etcd before every hop. It's cheap, and it's the difference between "restore and try again" and "rebuild the cluster" if a control-plane hop goes bad.
One control-plane node at a time, behind a health check. The next master doesn't get touched until the last one is Ready and etcd says ok. Quorum's never at risk.
Green cluster between hops, and no skipping versions. If hop three breaks something, you find out on hop three, not four versions later with no idea which one did it.
Keep a rollback per node. The play leaves a k3s.bak of the old binary on each node, so backing one out is stop, copy back, start.

Worst case at any single moment was one node, and it was always reversible. That's the whole reason I could do it with everything still live.

The Debian jump: 12 to 13

The OS upgrade is the same idea, except the payload is apt full-upgrade and a reboot instead of a binary swap, so each node is gone for minutes, not seconds. That makes the one-at-a-time rule matter even more: two control-plane nodes down together can cost you etcd quorum, and careless storage-node reboots put you right back in rebuild-storm land.

A cross-release Debian upgrade isn't just apt upgrade, either. It rewrites the apt sources from bookworm to trixie, follows Debian's two-step (apt-get upgrade --without-new-pkgs, then apt-get full-upgrade) to keep the change contained, and pulls a whole new kernel. It also has to run completely non-interactive — a major release bump is exactly where apt loves to stop and ask you something and hang forever on a headless box — so the play suppresses the prompts, lets services restart on their own, and keeps existing config files when there's a conflict.

One thing to watch: trixie moves the default Python from 3.11 to 3.13. That quietly breaks anything that assumed the old one, so the interpreter has to be pinned across the jump instead of getting autoremoved out from under you.

On rollback, I'll be honest: a major OS upgrade isn't really reversible with apt. You can't cleanly downgrade a whole Debian release. The actual safety net is a Proxmox snapshot of each node taken before the run. The play tars up the old apt sources and snapshots etcd as a courtesy, but the VM snapshot is the thing that would actually save you.

Where it ended up

By Sunday night phoenix was in better shape than a cluster I'd spent two days rebooting had any right to be:

Every node on Debian 13. Current OS, current kernel, a couple more years of runway.
k3s on 1.34.9+k3s1, back in support and off the dead 1.30.
kube-vip and MetalLB both adopted. No more orphaned add-ons frozen in time; both on a real versioned lifecycle now.
Flux 2.9 merged and running, which was the thing the whole exercise was for.

The part I'm happiest about isn't really the versions, though. It's the tooling. What used to be a nervous, manual, node-by-node afternoon is now three playbooks — one for the OS, one for k3s, one per add-on — that each take one command and handle a mixed compute-and-storage fleet without me having to remember which nodes hold Longhorn replicas. The next Debian bump is already just passing to_codename=forky to the same play. Next time this won't eat a weekend.

What I'd tell you, and maybe your AI

I did a lot of this with Claude in the loop — drafting the playbooks, talking through failure modes, sanity-checking the order of operations before I ran anything on a live cluster. It's genuinely good at that. It's also confidently wrong just often enough that a few of these are worth holding both of us to.

Don't drain a Longhorn node just to reboot it. Cordon it, bounce it, wait for green. Draining is for removing a node for good, and Longhorn treats it that way — it'll rebuild every replica on the node for what should have been a one-minute restart. Worth knowing before you watch it happen live.
Fix the orphaned add-ons before upgrading around them. Getting kube-vip and MetalLB onto a real lifecycle first is what made the k3s climb boring instead of scary.
Read the chart changelog, not just the version number. The frrk8s default, the chart-vs-app version gap, the CRD race — each one was a footnote that turned into a CrashLoopBackOff. An AI will happily call a minor chart bump "safe" without knowing any of them flipped.
One at a time, health-gated, idempotent. Every calm part of the weekend came from those three. Every tense part came from a moment I skipped one.
Trust the cluster over anyone's confidence — yours or the model's. When the tooling, the assistant, or your own memory disagrees with what kubectl get nodes is actually printing, the terminal wins. Every time.

The cluster's healthier than it's been in a long time, and the tooling to keep it that way is boring now, on purpose. Thanks for putting up with the blips.