Note: This post was written with AI assistance (Claude). The experience, commands, and errors are real — the prose had help.


I run a 3-node Patroni-managed PostgreSQL cluster on my homelab. Many months ago I set it up following TechnoTim's PostgreSQL High Availability guide — if you haven't read it, go there first, it's the definitive walkthrough. This post picks up where that one leaves off: what it actually takes to upgrade that cluster from PostgreSQL 17 to 18.

This is not an official upgrade guide, and it's not a criticism of TechnoTim's setup. It's just a record of what I ran into on a Tuesday evening after work, and how I got through it. If you followed his guide, the specific quirks of that configuration are exactly what will bite you here.

Your mileage may vary — every cluster has its own quirks depending on how it was set up, what version of Patroni you're running, and what's drifted over time. Treat this as a reference to help you get on the right track, not a step-by-step guarantee.


The Starting Point

Three nodes: postgres-db-01 (primary), postgres-db-02, postgres-db-03. Patroni managing replication, etcd for distributed consensus, PostgreSQL 17 from the PGDG apt repository. Running fine for months.

I wasn't even planning to upgrade. I ran sudo apt upgrade and it all fell apart.


Issue 1 — `apt upgrade` Left dpkg Broken

Installing Patroni 4.1.0 caused its post-install script to restart patroni.service. The service timed out, dpkg errored, and I was left with:

E: Sub-process /usr/bin/dpkg returned an error code (1)

The actual cause was buried in the logs:

Got notification message from PID 438757, but reception only permitted for main PID 438743

Patroni starts postgres as a child process. Postgres sends the sd_notify ready signal to systemd, but the unit file only allows the main Patroni PID to notify — so systemd never sees "ready", times out, and marks the service as failed. Patroni was actually running fine the whole time.

Fix: Add NotifyAccess=all on all three nodes:

sudo mkdir -p /etc/systemd/system/patroni.service.d/
sudo tee /etc/systemd/system/patroni.service.d/override.conf <<EOF
[Service]
NotifyAccess=all
TimeoutStartSec=120
EOF
sudo systemctl daemon-reload
sudo dpkg --configure -a

Do this on all three nodes before continuing — they'll all hit the same issue when Patroni restarts.


The Actual Upgrade

A Patroni cluster can't just apt upgrade through a major PostgreSQL version. The data directory has to be migrated with pg_upgrade first. Here's what the process looks like, including every gotcha I hit along the way.


Step 1 — Stop Patroni on All Nodes

sudo systemctl stop patroni

Make sure all three nodes are stopped before proceeding.


Step 2 — Install PostgreSQL 18 on All Nodes

sudo apt install -y postgresql-18

At this point both /usr/lib/postgresql/17 and /usr/lib/postgresql/18 exist on each node. Nothing is running yet.


Step 3 — Prepare for `pg_upgrade` on the Primary

pg_upgrade writes output files to the current working directory, and the running user needs write access there. Running it from /root or your home directory fails immediately:

You must have read and write access in the current directory.
Failure, exiting

The fix is to create a dedicated working directory owned by the postgres user before running anything:

sudo mkdir -p /var/lib/postgresql/pg_upgrade_tmp
sudo mkdir -p /var/lib/postgresql/data_new
sudo chown postgres:postgres /var/lib/postgresql/pg_upgrade_tmp
sudo chown postgres:postgres /var/lib/postgresql/data_new
cd /var/lib/postgresql/pg_upgrade_tmp

Always cd into that directory before running pg_upgrade. It inherits your current directory, and if that's somewhere the postgres user can't write, you'll hit the error again.


Step 4 — Fix the `pg_hba.conf` Path

This one tripped me up. The TechnoTim guide uses a custom data_dir path in Patroni. My config had data_dir: /var/lib/postgresql/data, but the postgresql.conf inside that directory had hba_file hardcoded to the same path.

After renaming the directory for the upgrade, pg_upgrade briefly starts the old PG17 server to read its catalog — and it couldn't find pg_hba.conf:

FATAL:  could not load /var/lib/postgresql/data/pg_hba.conf

Update the path in postgresql.conf to point at wherever the backup actually lives:

sudo sed -i "s|hba_file.*|hba_file = '/var/lib/postgresql/data_pg17_backup/pg_hba.conf'|" \
  /var/lib/postgresql/data_pg17_backup/postgresql.conf

Step 5 — Add a Local Trust Entry to `pg_hba.conf`

Even with the path corrected, pg_upgrade still couldn't connect to the PG17 server it started:

FATAL:  no pg_hba.conf entry for host "[local]", user "postgres", database "template1", no encryption

The pg_hba.conf from the TechnoTim guide only has hostssl entries for replication — there's no local trust entry for the postgres superuser. pg_upgrade needs to connect locally to read the old catalog.

Add one temporarily at the top of the file:

sudo sed -i '1s/^/local   all             postgres                                trust\n/' \
  /var/lib/postgresql/data_pg17_backup/pg_hba.conf

This is temporary — it only lives in the backup directory and gets left behind once the upgrade is done.


Step 6 — Run `pg_upgrade`

Always do the dry run first:

sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
  -b /usr/lib/postgresql/17/bin \
  -B /usr/lib/postgresql/18/bin \
  -d /var/lib/postgresql/data_pg17_backup \
  -D /var/lib/postgresql/data_new \
  --check

When you see *Clusters are compatible*, run it for real:

sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
  -b /usr/lib/postgresql/17/bin \
  -B /usr/lib/postgresql/18/bin \
  -d /var/lib/postgresql/data_pg17_backup \
  -D /var/lib/postgresql/data_new

Upgrade Complete is what you're after. The output walks through each step as it goes — it takes a minute or two depending on database size.


Step 7 — Swap Data Directories on the Primary

sudo -u postgres mv /var/lib/postgresql/data /var/lib/postgresql/data_empty
sudo -u postgres mv /var/lib/postgresql/data_new /var/lib/postgresql/data

The upgraded PG18 data is now in the expected location.


Step 8 — Update `bin_dir` in Patroni Config on All Nodes

On all three nodes, update the binary path in config.yml:

sudo sed -i 's|postgresql/17/bin|postgresql/18/bin|g' /etc/patroni/config.yml
grep bin_dir /etc/patroni/config.yml  # verify

Step 9 — Wipe Replica Data Directories

Patroni will re-clone the replicas from the upgraded primary automatically. You just need to give it a clean directory to work with. On postgres-db-02 and postgres-db-03:

sudo rm -rf /var/lib/postgresql/data
sudo mkdir -p /var/lib/postgresql/data
sudo chmod 700 /var/lib/postgresql/data
sudo chown postgres:postgres /var/lib/postgresql/data

The chmod 700 matters — I forgot it the first time and got this:

FATAL: data directory "/var/lib/postgresql/data" has invalid permissions
DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

Issue 2 — System ID Mismatch After Upgrade

With everything in place, starting Patroni on the primary failed immediately:

CRITICAL: system ID mismatch, node postgres-db-01 belongs to a different cluster:
3829147562094871203 != 6104938271650349817

Here's what happened: pg_upgrade runs initdb to initialize the new PG18 cluster, which generates a new system identifier. Etcd still had the old ID registered from the PG17 cluster. Patroni sees the mismatch and refuses to start — it's doing the right thing, but you have to clear the stale state out of etcd manually.

Stop Patroni on all nodes first, then clear all cluster keys:

sudo etcdctl \
  --endpoints=https://10.0.0.11:2379,https://10.0.0.12:2379,https://10.0.0.13:2379 \
  --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/etcd.crt \
  --key=/etc/etcd/ssl/etcd.key \
  del /service/postgresql-cluster --prefix

Verify it returns nothing before continuing:

sudo etcdctl \
  --endpoints=https://10.0.0.11:2379,https://10.0.0.12:2379,https://10.0.0.13:2379 \
  --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/etcd.crt \
  --key=/etc/etcd/ssl/etcd.key \
  get /service/postgresql-cluster --prefix --keys-only

An empty response means you're clear to proceed.


Bringing the Cluster Back Up

Start the primary first and wait for it to elect itself leader:

# On postgres-db-01:
sudo systemctl start patroni
patronictl -c /etc/patroni/config.yml --insecure list

Wait until you see Leader in the Role column. Don't start the replicas until the primary is stable — they need something to clone from.

Once the primary is up, start the replicas. Patroni will handle cloning automatically:

# On postgres-db-02 and postgres-db-03:
sudo systemctl start patroni
sudo journalctl -fu patroni.service  # watch the clone progress

You'll see the base backup streaming in the journal. Give it a minute, then verify the cluster is healthy:

patronictl -c /etc/patroni/config.yml --insecure list

Issue 3 — Apps Failing After Upgrade

With the cluster back up and PG18 running, my Node.js apps started throwing connection errors. The issue was that pg_hba.conf only had hostssl entries — no plain host entries for app connections that don't use SSL.

The natural instinct is to edit the bootstrap.pg_hba section of config.yml, but that section is only applied during initial cluster creation. Editing it on a running cluster does nothing.

The right tool is patronictl edit-config, which stores config in etcd and applies it cluster-wide:

patronictl -c /etc/patroni/config.yml --insecure edit-config

Add your pg_hba entries there. Also: always use --insecure with patronictl when you're using self-signed certificates — otherwise every command fails with:

SSLCertVerificationError: certificate verify failed: self-signed certificate

Post-Upgrade Cleanup

Run the recommended post-upgrade stats commands:

sudo -u postgres /usr/lib/postgresql/18/bin/vacuumdb --all --analyze-in-stages --missing-stats-only
sudo -u postgres /usr/lib/postgresql/18/bin/vacuumdb --all --analyze-only

Then clean up the leftover files:

# On postgres-db-01:
cd /var/lib/postgresql/pg_upgrade_tmp
sudo -u postgres ./delete_old_cluster.sh
sudo rm -rf /var/lib/postgresql/data_pg17_backup
sudo rm -rf /var/lib/postgresql/data_empty
sudo rm -rf /var/lib/postgresql/pg_upgrade_tmp

# On all nodes:
sudo apt remove postgresql-17
sudo apt autoremove

Summary of Gotchas

# Problem Fix
1 apt upgrade breaks dpkg — Patroni timeout Add NotifyAccess=all systemd override on all nodes
2 pg_upgrade needs writable working directory cd to a postgres-owned dir first
3 pg_hba.conf path hardcoded in postgresql.conf Update hba_file to point at the backup directory
4 No local trust entry for postgres user Add local all postgres trust at top of pg_hba.conf
5 System ID mismatch — etcd has old cluster ID Clear all etcd keys under /service/postgresql-cluster
6 Replica data directory permissions too open chmod 700 /var/lib/postgresql/data on replicas
7 Apps rejected after upgrade — no non-SSL pg_hba entries Use patronictl edit-config, not bootstrap.pg_hba

The cluster runs PG18 without issue now — all three nodes healthy, replicas cloned cleanly, apps reconnected. A Tuesday evening of troubleshooting, start to finish. Hopefully this saves you most of that.