Note: This post was written with AI assistance (Claude). The experience, commands, and errors are real — the prose had help.
I run a 3-node Patroni-managed PostgreSQL cluster on my homelab. Many months ago I set it up following TechnoTim's PostgreSQL High Availability guide — if you haven't read it, go there first, it's the definitive walkthrough. This post picks up where that one leaves off: what it actually takes to upgrade that cluster from PostgreSQL 17 to 18.
This is not an official upgrade guide, and it's not a criticism of TechnoTim's setup. It's just a record of what I ran into on a Tuesday evening after work, and how I got through it. If you followed his guide, the specific quirks of that configuration are exactly what will bite you here.
Your mileage may vary — every cluster has its own quirks depending on how it was set up, what version of Patroni you're running, and what's drifted over time. Treat this as a reference to help you get on the right track, not a step-by-step guarantee.
The Starting Point
Three nodes: postgres-db-01 (primary), postgres-db-02, postgres-db-03. Patroni managing replication, etcd for distributed consensus, PostgreSQL 17 from the PGDG apt repository. Running fine for months.
I wasn't even planning to upgrade. I ran sudo apt upgrade and it all fell apart.
Issue 1 — `apt upgrade` Left dpkg Broken
Installing Patroni 4.1.0 caused its post-install script to restart patroni.service. The service timed out, dpkg errored, and I was left with:
E: Sub-process /usr/bin/dpkg returned an error code (1)
The actual cause was buried in the logs:
Got notification message from PID 438757, but reception only permitted for main PID 438743
Patroni starts postgres as a child process. Postgres sends the sd_notify ready signal to systemd, but the unit file only allows the main Patroni PID to notify — so systemd never sees "ready", times out, and marks the service as failed. Patroni was actually running fine the whole time.
Fix: Add NotifyAccess=all on all three nodes:
sudo mkdir -p /etc/systemd/system/patroni.service.d/
sudo tee /etc/systemd/system/patroni.service.d/override.conf <<EOF
[Service]
NotifyAccess=all
TimeoutStartSec=120
EOF
sudo systemctl daemon-reload
sudo dpkg --configure -a
Do this on all three nodes before continuing — they'll all hit the same issue when Patroni restarts.
The Actual Upgrade
A Patroni cluster can't just apt upgrade through a major PostgreSQL version. The data directory has to be migrated with pg_upgrade first. Here's what the process looks like, including every gotcha I hit along the way.
Step 1 — Stop Patroni on All Nodes
sudo systemctl stop patroni
Make sure all three nodes are stopped before proceeding.
Step 2 — Install PostgreSQL 18 on All Nodes
sudo apt install -y postgresql-18
At this point both /usr/lib/postgresql/17 and /usr/lib/postgresql/18 exist on each node. Nothing is running yet.
Step 3 — Prepare for `pg_upgrade` on the Primary
pg_upgrade writes output files to the current working directory, and the running user needs write access there. Running it from /root or your home directory fails immediately:
You must have read and write access in the current directory.
Failure, exiting
The fix is to create a dedicated working directory owned by the postgres user before running anything:
sudo mkdir -p /var/lib/postgresql/pg_upgrade_tmp
sudo mkdir -p /var/lib/postgresql/data_new
sudo chown postgres:postgres /var/lib/postgresql/pg_upgrade_tmp
sudo chown postgres:postgres /var/lib/postgresql/data_new
cd /var/lib/postgresql/pg_upgrade_tmp
Always cd into that directory before running pg_upgrade. It inherits your current directory, and if that's somewhere the postgres user can't write, you'll hit the error again.
Step 4 — Fix the `pg_hba.conf` Path
This one tripped me up. The TechnoTim guide uses a custom data_dir path in Patroni. My config had data_dir: /var/lib/postgresql/data, but the postgresql.conf inside that directory had hba_file hardcoded to the same path.
After renaming the directory for the upgrade, pg_upgrade briefly starts the old PG17 server to read its catalog — and it couldn't find pg_hba.conf:
FATAL: could not load /var/lib/postgresql/data/pg_hba.conf
Update the path in postgresql.conf to point at wherever the backup actually lives:
sudo sed -i "s|hba_file.*|hba_file = '/var/lib/postgresql/data_pg17_backup/pg_hba.conf'|" \
/var/lib/postgresql/data_pg17_backup/postgresql.conf
Step 5 — Add a Local Trust Entry to `pg_hba.conf`
Even with the path corrected, pg_upgrade still couldn't connect to the PG17 server it started:
FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "template1", no encryption
The pg_hba.conf from the TechnoTim guide only has hostssl entries for replication — there's no local trust entry for the postgres superuser. pg_upgrade needs to connect locally to read the old catalog.
Add one temporarily at the top of the file:
sudo sed -i '1s/^/local all postgres trust\n/' \
/var/lib/postgresql/data_pg17_backup/pg_hba.conf
This is temporary — it only lives in the backup directory and gets left behind once the upgrade is done.
Step 6 — Run `pg_upgrade`
Always do the dry run first:
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
-b /usr/lib/postgresql/17/bin \
-B /usr/lib/postgresql/18/bin \
-d /var/lib/postgresql/data_pg17_backup \
-D /var/lib/postgresql/data_new \
--check
When you see *Clusters are compatible*, run it for real:
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
-b /usr/lib/postgresql/17/bin \
-B /usr/lib/postgresql/18/bin \
-d /var/lib/postgresql/data_pg17_backup \
-D /var/lib/postgresql/data_new
Upgrade Complete is what you're after. The output walks through each step as it goes — it takes a minute or two depending on database size.
Step 7 — Swap Data Directories on the Primary
sudo -u postgres mv /var/lib/postgresql/data /var/lib/postgresql/data_empty
sudo -u postgres mv /var/lib/postgresql/data_new /var/lib/postgresql/data
The upgraded PG18 data is now in the expected location.
Step 8 — Update `bin_dir` in Patroni Config on All Nodes
On all three nodes, update the binary path in config.yml:
sudo sed -i 's|postgresql/17/bin|postgresql/18/bin|g' /etc/patroni/config.yml
grep bin_dir /etc/patroni/config.yml # verify
Step 9 — Wipe Replica Data Directories
Patroni will re-clone the replicas from the upgraded primary automatically. You just need to give it a clean directory to work with. On postgres-db-02 and postgres-db-03:
sudo rm -rf /var/lib/postgresql/data
sudo mkdir -p /var/lib/postgresql/data
sudo chmod 700 /var/lib/postgresql/data
sudo chown postgres:postgres /var/lib/postgresql/data
The chmod 700 matters — I forgot it the first time and got this:
FATAL: data directory "/var/lib/postgresql/data" has invalid permissions
DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
Issue 2 — System ID Mismatch After Upgrade
With everything in place, starting Patroni on the primary failed immediately:
CRITICAL: system ID mismatch, node postgres-db-01 belongs to a different cluster:
3829147562094871203 != 6104938271650349817
Here's what happened: pg_upgrade runs initdb to initialize the new PG18 cluster, which generates a new system identifier. Etcd still had the old ID registered from the PG17 cluster. Patroni sees the mismatch and refuses to start — it's doing the right thing, but you have to clear the stale state out of etcd manually.
Stop Patroni on all nodes first, then clear all cluster keys:
sudo etcdctl \
--endpoints=https://10.0.0.11:2379,https://10.0.0.12:2379,https://10.0.0.13:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/etcd.crt \
--key=/etc/etcd/ssl/etcd.key \
del /service/postgresql-cluster --prefix
Verify it returns nothing before continuing:
sudo etcdctl \
--endpoints=https://10.0.0.11:2379,https://10.0.0.12:2379,https://10.0.0.13:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/etcd.crt \
--key=/etc/etcd/ssl/etcd.key \
get /service/postgresql-cluster --prefix --keys-only
An empty response means you're clear to proceed.
Bringing the Cluster Back Up
Start the primary first and wait for it to elect itself leader:
# On postgres-db-01:
sudo systemctl start patroni
patronictl -c /etc/patroni/config.yml --insecure list
Wait until you see Leader in the Role column. Don't start the replicas until the primary is stable — they need something to clone from.
Once the primary is up, start the replicas. Patroni will handle cloning automatically:
# On postgres-db-02 and postgres-db-03:
sudo systemctl start patroni
sudo journalctl -fu patroni.service # watch the clone progress
You'll see the base backup streaming in the journal. Give it a minute, then verify the cluster is healthy:
patronictl -c /etc/patroni/config.yml --insecure list
Issue 3 — Apps Failing After Upgrade
With the cluster back up and PG18 running, my Node.js apps started throwing connection errors. The issue was that pg_hba.conf only had hostssl entries — no plain host entries for app connections that don't use SSL.
The natural instinct is to edit the bootstrap.pg_hba section of config.yml, but that section is only applied during initial cluster creation. Editing it on a running cluster does nothing.
The right tool is patronictl edit-config, which stores config in etcd and applies it cluster-wide:
patronictl -c /etc/patroni/config.yml --insecure edit-config
Add your pg_hba entries there. Also: always use --insecure with patronictl when you're using self-signed certificates — otherwise every command fails with:
SSLCertVerificationError: certificate verify failed: self-signed certificate
Post-Upgrade Cleanup
Run the recommended post-upgrade stats commands:
sudo -u postgres /usr/lib/postgresql/18/bin/vacuumdb --all --analyze-in-stages --missing-stats-only
sudo -u postgres /usr/lib/postgresql/18/bin/vacuumdb --all --analyze-only
Then clean up the leftover files:
# On postgres-db-01:
cd /var/lib/postgresql/pg_upgrade_tmp
sudo -u postgres ./delete_old_cluster.sh
sudo rm -rf /var/lib/postgresql/data_pg17_backup
sudo rm -rf /var/lib/postgresql/data_empty
sudo rm -rf /var/lib/postgresql/pg_upgrade_tmp
# On all nodes:
sudo apt remove postgresql-17
sudo apt autoremove
Summary of Gotchas
| # | Problem | Fix |
|---|---|---|
| 1 | apt upgrade breaks dpkg — Patroni timeout |
Add NotifyAccess=all systemd override on all nodes |
| 2 | pg_upgrade needs writable working directory |
cd to a postgres-owned dir first |
| 3 | pg_hba.conf path hardcoded in postgresql.conf |
Update hba_file to point at the backup directory |
| 4 | No local trust entry for postgres user |
Add local all postgres trust at top of pg_hba.conf |
| 5 | System ID mismatch — etcd has old cluster ID | Clear all etcd keys under /service/postgresql-cluster |
| 6 | Replica data directory permissions too open | chmod 700 /var/lib/postgresql/data on replicas |
| 7 | Apps rejected after upgrade — no non-SSL pg_hba entries |
Use patronictl edit-config, not bootstrap.pg_hba |
The cluster runs PG18 without issue now — all three nodes healthy, replicas cloned cleanly, apps reconnected. A Tuesday evening of troubleshooting, start to finish. Hopefully this saves you most of that.