When iowait Lies: A Proxmox 9, QEMU 11, and Pinned-Kernel Postmortem

I first noticed it during a routine check on my largest Proxmox host — the same colocated Dell PowerEdge that runs my Tesla P4 LLM stack, dual E5-2650 v2 Xeons and 256 GB of RAM hosting around 25 VMs. The dashboard was sitting at 40-plus percent iowait. Nothing on the box actually felt slow, though, so my first guess was the boring one: a RAID patrol read — the controller quietly scrubbing the array in the background. Annoying, harmless, self-resolving. I left it alone.

It didn't resolve. Over the next few days the number stayed pinned, and the patrol-read theory started to wear thin — a scrub is supposed to finish. I rebooted the host at one point, half-expecting that to clear it; the number came right back. I wanted to confirm the theory one way or the other, but I'd never installed the controller tooling to actually read the patrol-read state, and with nothing visibly broken I let it ride. Probably still a patrol read. Probably.

After about a week of it sitting there, I'd run out of "probably." Something was wrong, and it was time to actually investigate. The fastest way I've found to think through a problem like this is to talk it out with an AI assistant — far less friction than wrestling the right question into a search engine — so I started by describing the setup.

The backstory: a RAID card that won't TRIM

The PERC H710 is a 2012-era hardware RAID controller — an LSI 2208 under Dell's badge — and the ten Samsung 850 PRO SSDs in this box all hang off it, presented as a single ~6.7 TB virtual disk with an LVM-thin pool on top. It was excellent in its day and still does its job, but it predates any notion that the disks behind it might be SSDs that care about being told when blocks are freed: it does not pass TRIM/UNMAP through to them. Whatever the guest filesystems discard, the controller swallows, and the drives never hear about it.

That matters because of how an SSD stays fast. To keep writes snappy, the drive sets aside some of its free space as a high-speed write cache — quick blocks it can absorb incoming writes into and tidy up afterward. That only works while it has free blocks to spare, and TRIM is what keeps them coming: when you delete a file, TRIM is the signal that tells the drive "this space is free again," topping up that pool of fast, ready-to-use blocks.

Without TRIM, the drive never hears that anything was freed. Every block it has ever written to stays snagged — counted as in-use forever, even after whatever lived there is long gone. As more of the drive gets snagged, there's less and less free space left to feed that fast cache, and writes that used to land in it start falling through to the slow path instead. Nothing breaks; the drive just gets quietly slower as it fills up. On consumer drives like the 850 PRO, behind a controller that eats TRIM, it's a slow-motion problem — fine for a long time, then gradually not.

So a while back I'd built a workaround: over-provisioning, by hand. When the array was built I left roughly 100 GB on every drive unconfigured — a slice of each SSD the controller was told to simply never touch. Because the drives are never handed that space, each one always counts it as free: a permanent per-drive reserve to keep feeding that fast cache from, TRIM or no TRIM, so no drive ever runs completely dry. The one rule that makes it work behind a controller that won't pass UNMAP: the reserve has to stay genuinely untouched. A drive can never be told a block was freed, so the only blocks it knows are free are the ones it was never handed in the first place. Leave that 100 GB per drive carved out and empty, and every SSD always has somewhere to breathe.

Which is exactly why, when the iowait number climbed and wouldn't leave, the no-TRIM wall was such a believable suspect. This wasn't a failure mode I'd only read about — it was the specific weakness I'd already engineered around. The obvious question was whether the 100 GB I'd left on each drive had finally stopped being enough.

It turned out to be neither. The disks were fine, the controller was fine, and the cause was a kernel I'd pinned during a major upgrade and never moved off. Here's the full walk — each step that cleared a layer is a useful reminder of how to read a system that looks sick but isn't.

Rule one: iowait is not "disk busy"

The reflex with high iowait is to look at disk throughput. That's the trap. %iowait is not a measure of how busy your disk is — it's idle CPU time that happened to coincide with an outstanding I/O. It is, definitionally, a subset of idle. So the first command that matters isn't about megabytes per second; it's about latency:

iostat -x 1

Device   r/s    w/s   r_await  w_await  aqu-sz  %util
sda      582    578   0.77     0.94     1.04    30.4

Sub-millisecond write latency. Average queue depth around one. Thirty percent utilized. That is a coasting array, not a saturated one. Whatever was generating 40% iowait, it was not a slow or busy disk. The lesson is worth internalizing: read await (how long each I/O takes) and aqu-sz (how deep the queue is), and on a multi-disk array ignore %util entirely — it only reports whether at least one request was in flight, which a parallel device can show as low even when busy, or high when idle.

Ruling out the hardware, one layer at a time

If the array is fast on average, maybe a single dying drive is dragging the stripe. SMART, read straight through the megaraid driver since the controller hides the disks:

for i in 0 1 2 3 4 5 6 7 8 9; do smartctl -a -d megaraid,$i /dev/sda; done

Device Model:           Samsung SSD 850 PRO 1TB
Reallocated_Sector_Ct   0
Wear_Leveling_Count     080      # ~20% of MLC endurance used
CRC_Error_Count         0
Airflow_Temperature     27 C

All ten drives reported the same clean bill: zero reallocated sectors, zero CRC errors, roughly 20% of their MLC write endurance consumed, and temperatures in the high twenties. No failing drive, no thermal throttling, no wear wall. And those microsecond writes from the previous step had already cleared the controller: a PERC with a dead cache battery falls back to write-through, and write latency would be in the milliseconds, not the microseconds I was seeing. The hardware was exonerated. The "no-TRIM wall" hypothesis was dead.

The plot twist: pressure stall information

This is the point where I nearly declared the whole thing a cosmetic accounting artifact and walked away. Then I checked PSI — the kernel's pressure-stall accounting, which is far more honest than %iowait:

cat /proc/pressure/io

some avg300=81.30
full avg300=68.69

full=68.69 means that for roughly 69% of the time, every runnable task on the box was simultaneously blocked on I/O, with nothing able to make progress. That is a real, severe stall — not cosmetic at all. Which produced the central contradiction of the whole investigation: the disk serves I/O in microseconds and is barely utilized, yet the kernel insists everything is stalled waiting on it. Reconciling those two facts is the entire puzzle.

A quick check ruled out the obvious confounders. Memory pressure was flat zero, with 165 GB free and swap unused — no thrashing, no refault. CPU pressure was zero. So the stall was real, it was I/O, and it was neither the hardware nor memory.

Who is actually doing the I/O

iotop -boa

Total writes across the entire host came to about 7 MB/s — trivial. But one VM dominated the rest by roughly ten to one: my LibreNMS instance, writing thousands of tiny fsync'd RRD and MySQL updates every polling cycle. Behind it, the k3s control plane's etcd, which fsyncs on every write. Both are classic small-synchronous-write workloads. It also quietly retired the patrol-read theory I'd started the week with: the I/O was coming from my own guests, not the controller scrubbing the array in the background. This explained the baseline — a constant trickle of tiny durable writes keeps I/O perpetually in flight — but not why each of those microsecond-fast writes was registering as a catastrophic, everything-is-blocked stall.

The real culprit: a mismatch I'd built myself

Then I looked at what I was actually running, which is the check I should have run first:

pveversion        # -> pve-manager/9.2.3
uname -r          # -> 6.5.13-5-pve

Proxmox VE 9.2.3 userland, on a 6.5 kernel from 2024. That kernel shipped with Proxmox 8. I was running a two-year-old kernel underneath a current hypervisor. Why?

proxmox-boot-tool kernel list

Pinned kernel:
6.5.13-5-pve

Automatically selected kernels:
7.0.6-2-pve
7.0.2-6-pve
6.17.13-13-pve
6.5.13-5-pve

Pinned. And I'd pinned it deliberately: an NVIDIA GPU-sharing driver on this host only worked against that 6.5 kernel, so the pin was the price of keeping the GPUs carved up across VMs. That's a perfectly ordinary reason to pin a kernel — the mistake wasn't pinning it, it was leaving the pin in place, unexamined, while the rest of the stack moved on. The dpkg logs put the 9.0 install at September 2025, so the kernel had been carried forward, pin intact, across every reboot since, while the userland kept advancing around it — most recently to QEMU 11. The high iowait itself only showed up over the past several days, which is the tell: the stale kernel was a latent condition for a while, but nothing went wrong until the other half of the pair moved far enough. Once it did, every reboot simply re-selected the same pinned kernel, which is why the stall survived each one I tried.

Meanwhile the upgrade to Proxmox 9 had pulled in a brand-new QEMU 11, which routes every VM's disk I/O through Linux's io_uring asynchronous interface. You can read it right in the launch arguments:

qm showcmd 109 --pretty | grep -o '"aio":"[^"]*"'
# "aio":"io_uring"

So the shape of the problem was: a 2026-era QEMU issuing all guest I/O through io_uring, against a 2024 kernel's io_uring implementation. That interface evolved enormously across the intervening kernel releases. The new userland and the old kernel were a poor match, and the symptom was that every small synchronous write spent an absurd amount of time accounted as stalled — even though the flash underneath answered in microseconds. The disk scheduler was unchanged across the two kernels (mq-deadline in both), which rules the elevator out and points squarely at the I/O submission path.

The fix, and the proof

The fix is to boot the kernel that matches the userland. Since the safe move on a passthrough host is to control exactly which kernel boots rather than letting it float, I pinned the current one explicitly:

proxmox-boot-tool kernel pin 7.0.6-2-pve
reboot

Same VMs, same LibreNMS fsync storm, measured a few hours later on the new kernel:

Metric (identical workload)	kernel 6.5.13-5-pve	kernel 7.0.6-2-pve
PSI io `full avg300`	68.7%	0.00%
PSI io `some avg300`	81.3%	0.00%
`%iowait`	~43%	0.14%
`sda %util`	~28%	2.8%
`sda w_await`	0.94 ms	0.58 ms
`sda` write rate	~540 w/s	~415 w/s

The workload didn't change. The disks didn't change. The scheduler didn't change. The only variable was booting the kernel whose io_uring matched the QEMU driving it — and the stall evaporated. I/O pressure went from "everything is blocked two-thirds of the time" to a flat zero, and the array's own utilization dropped by an order of magnitude for the very same writes.

One honest caveat, since this is the internet: this is a natural before-and-after on the same machine, not a controlled A/B where I re-pin the old kernel to reproduce on demand. But the workload is identical, the hardware is identical, the scheduler is identical, and PSI full going to a flat zero is about as unambiguous as field evidence gets.

What I'd tell past me

iowait is a liar; trust PSI. /proc/pressure/io told the real story in both directions — first that there genuinely was a stall (when I was about to dismiss it), and then, by dropping to zero, that the fix had worked. It belongs in your first five commands, not your last.
Check kernel/userland alignment before you suspect hardware. uname -r against pveversion is a five-second check. It would have saved me a full sweep of SMART data, controller cache state, and a half-drafted plan to rebuild a storage array that was never the problem.
A pin is a debt. Pinning a kernel to satisfy a finicky driver — in my case an NVIDIA GPU-sharing driver that only built against 6.5 — is often unavoidable. But it's a loan you have to pay back. Leave it in place and your shiny new hypervisor ends up driving a years-old kernel, and the day the userland advances far enough, the mismatch surfaces.
Trust the running system over your assumptions. A patrol read, then the no-TRIM SSD wall, were both reasonable suspects given the symptom and the hardware — but the box ruled each out at every layer it could: latency, SMART, PSI. The sharpest reminder came near the end, when the AI I was working through it with confidently insisted there was no such thing as a Linux 7.0 kernel yet — and proxmox-boot-tool kernel list was already showing two signed 7.0.x builds installed on the box. Tools, assistants, the assumptions in your head: when any of them disagree with the output on the terminal, the terminal wins.

If your Proxmox dashboard is screaming iowait and your disks swear they're fine, check uname -r before you order new drives.