Joshua Mein

Unattended-Upgrades Was Sending Mail to Gmail for Six Weeks. Gmail Was Silently Dropping All of It.

2026-05-21T22:00:00+01:00

Six weeks ago I rolled out unattended-upgrades across every Linux host in my homelab. 34 servers, one Ansible playbook, msmtp pointed at Brevo as the relay. The deploy went green. Every host’s /var/log/msmtp.log showed smtpstatus=250 ... exitcode=EX_OK for every send. Job done.

To be clear up front: the actual patching worked perfectly the whole time. Zabbix was scraping apt package counts and the systemd timers on every host, so I could see the daily 05:00 / 06:00 runs firing, packages getting upgraded, and reboots happening on schedule. That part was never in doubt. The bit that quietly didn’t work was the mail report on change, which was supposed to land in my Gmail every time a host actually upgraded something so I’d see what changed. “Confirm those reports actually arrive” sat as a backlog item for six weeks, because the rest of the chain was so visibly healthy that it was easy to keep deferring.

When I finally got around to it today, I went looking for one of those reports in Gmail. There were none. Not in Inbox. Not in Spam. Not in All Mail. Not one, ever, since the day of the deploy.

This is the story of how a single missing line of config managed to look perfectly healthy at every layer except the one that actually mattered.

The Setup

For context: the SMTP backbone here is the same one I wrote about in Why I Switched From Gmail to Brevo for All My Homelab Email Alerts. Every host has msmtp-mta installed with a 600-permission /etc/msmtprc pointing at smtp-relay.brevo.com:587. UU’s Mail directive sends to a Gmail address. The path is:

unattended-upgrades  -->  /usr/sbin/sendmail (msmtp symlink)  -->  Brevo SMTP  -->  Gmail

The audit started as “are these even working?” and ended somewhere different. The first pass was easy: SSH to every host, send a tagged test email through that host’s own msmtp, confirm smtpstatus=250 post-send, write a per-group results file. 32 of 34 reachable hosts passed. One host was missing the msmtp-mta package entirely (a separate problem, fix queued). One was offline (a laptop PBS, expected).

The 32-pass result was correct as far as it went. Brevo was happily accepting every single message.

What I didn’t think to test was the actual delivery. None of those test mails were ever opened by a human. They were just signals that Brevo’s SMTP server was returning 250.

Good question to ask: are these even arriving?

DNS First, Because That’s the Easy Box to Tick

If Brevo is queuing messages but Gmail isn’t delivering them, the first place to look is whether the sending domain is even in good standing.

dig +short TXT yourdomain.example
dig +short TXT _dmarc.yourdomain.example
dig +short TXT selector1._domainkey.yourdomain.example

What I found:

Record	Value
SPF	none
DMARC	`v=DMARC1; p=none; rua=mailto:rua@dmarc.brevo.com`
DKIM (`selector1._domainkey`)	present
MX	none

So:

No SPF. Means SPF alignment can’t help us. Whatever Gmail makes of authenticity has to come from DKIM.
DMARC is p=none. Gmail won’t bounce a misaligned message; it’ll either send it to spam or drop it on the floor and tell rua@dmarc.brevo.com about it. No NDR comes back to me.
DKIM is set up correctly by Brevo. They sign with their own keys for d=yourdomain.example because I delegated the selector to them when I switched.

That mostly absolves DNS. Brevo’s DKIM signing was working. So why doesn’t Gmail like the messages?

The A/B That Settled It

I sent two emails from the same host, through the same msmtp config, to the same Gmail address, about a second apart. The only difference was the message-level From: header.

# Test 1: From: root
{
  echo "From: root"
  echo "To: you@gmail.example"
  echo "Subject: [TEST] From: root"
  echo ""
  echo "This is what unattended-upgrades sends by default."
} | /usr/sbin/sendmail -t -oi

# Test 2: From: a real address on the sending domain
{
  echo "From: unattended-upgrades@yourdomain.example"
  echo "To: you@gmail.example"
  echo "Subject: [TEST] From: real-address"
  echo ""
  echo "This is what UU sends with Sender configured."
} | /usr/sbin/sendmail -t -oi

Both came back from msmtp with smtpstatus=250 ... exitcode=EX_OK. Brevo accepted both.

Only the second one arrived in Gmail.

The first one, with From: root, just disappeared.

So Where Was the `From: root` Coming From?

I had to actually open /usr/bin/unattended-upgrade (a Python script despite the name) and grep around. The relevant code is on or near line 1506 of unattended-upgrades 2.9.x:

from_email = apt_pkg.config.find("Unattended-Upgrade::Sender", "root")

Read it once and the bug is right there. UU calls apt_pkg.config.find with the directive name and a default. The default is the string "root". Literal root. No @, no domain.

When Unattended-Upgrade::Sender is unconfigured, UU writes From: root into the message body before piping the whole thing into /usr/sbin/sendmail. msmtp picks it up, hands it to Brevo. Brevo doesn’t care about the message-level From:; it cares about the SMTP envelope MAIL FROM: (unattended-upgrades@yourdomain.example, from msmtprc), DKIM-signs the message for d=yourdomain.example, and queues it.

Gmail then receives a message that says, in the header:

From: root

And starts asking awkward questions:

RFC-5322 says the From: header must contain at least one mailbox address with a domain. Bare root is not a valid mailbox. That alone is a strong negative signal.
DMARC alignment compares the header From: domain against the DKIM-signed d= domain. Header domain is empty (or whatever Gmail decides to do with root). DKIM d= is yourdomain.example. Alignment fails.
With DMARC p=none, Gmail’s policy is “don’t bounce, just decide”. Gmail decided. The message is gone.

This is also why the dropped messages don’t appear in Spam. Spam-foldering is a deliberate “this is suspicious but we’ll show it to you anyway” decision. A malformed From: that fails DMARC under p=none can be dropped before it ever gets to a folder.

Why Did the 2026-04-07 Deploy Validation Miss This?

The validation criteria for the deploy were:

apt-daily-upgrade.timer enabled and active
/etc/msmtprc correct, mode 600
/etc/apt/apt.conf.d/50unattended-upgrades present with Mail and MailReport
A test send from each host returns msmtp exit 0 with Brevo smtpstatus=250

Every one of those was true on every host. Zabbix on top of that was telling me that the patches were actually landing. So at every monitoring layer, the deploy looked fine.

The thing nobody validated was the very last hop: “open the destination inbox and confirm the on-change report is actually there.” That step sat as a backlog item because the surrounding signal was so good. Hosts were patching themselves, Zabbix was happy, msmtp was returning 250. Why bother eyeballing Gmail?

UU’s MailReport "on-change" semantics make this worse, not better. On a quiet day with no upgrades, an empty inbox is the correct state. So the inbox looks identical whether the pipeline is healthy or completely broken. You only notice the gap on a day where you expect a report (because something upgraded) and one doesn’t show up. And if you’re not checking, you don’t notice.

The lesson is the same one in the blog post on the Proxmox SSL renewal flow: every automated email path needs a “did it actually arrive” check, not just a “did the sender return 0” check. I now have a small audit script that sends a tagged test email from each host with a unique X-Audit-Id, then I grep the destination inbox for the IDs. That’s the test the 2026-04-07 deploy didn’t have.

The Fix

One line. Add Unattended-Upgrade::Sender to your 50unattended-upgrades config:

Unattended-Upgrade::Sender "unattended-upgrades@yourdomain.example";

That value should match whatever address your relay actually DKIM-signs. In my case, that’s unattended-upgrades@yourdomain.example because Brevo signs everything from d=yourdomain.example. With it set, UU writes:

From: unattended-upgrades@yourdomain.example

Brevo signs for d=yourdomain.example. Gmail compares the header From: domain (yourdomain.example) against DKIM d= (yourdomain.example). Aligned. Accepted. Delivered.

In the Ansible playbook that drives my fleet, the change is two lines per role block (one for VM hosts, one for Proxmox hypervisors):

           Unattended-Upgrade::MailReport "";
+          // Sender added: UU defaults the From: header to the literal "root" if
+          // this is unset, which Gmail drops because DMARC alignment fails.
+          Unattended-Upgrade::Sender "";
           Unattended-Upgrade::SyslogEnable "true";

I templated it off the existing unattended_upgrades_smtp_from variable that’s already in group_vars/all/vars.yml, since that’s the same value msmtp uses for the SMTP envelope MAIL FROM:. One source of truth, no drift between header and envelope.

Rolling It Out

The playbook handles VM hosts and Proxmox hypervisors with two when: blocks (one for each, because the schedule offsets differ). I ran it with --tags config to only touch the apt config, no package re-installs:

ansible-playbook configure-unattended-upgrades.yml --tags config

Three hosts failed on the first run with:

Failed to get information on remote file (/etc/apt/apt.conf.d/50unattended-upgrades):
  /bin/sh: 1: sudo: not found

Those were the three Proxmox Backup Server VMs. The PBS appliance image runs as root and doesn’t ship sudo. Easy fix: re-run scoped to your [pbs] inventory group with become disabled.

ansible-playbook configure-unattended-upgrades.yml --tags config \
  --limit pbs \
  -e ansible_become=false

Worth fixing in inventory long-term so the override isn’t needed each time, but for a one-shot patch the -e works.

After both runs, the live config on a representative sample showed the new directive everywhere:

$ ssh root@host grep '^Unattended-Upgrade::' /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Origins-Pattern { ... };
Unattended-Upgrade::Mail "you@gmail.example";
Unattended-Upgrade::MailReport "on-change";
Unattended-Upgrade::Sender "unattended-upgrades@yourdomain.example";
Unattended-Upgrade::SyslogEnable "true";

A re-run after the patch is the cleanest way to confirm idempotency. All hosts came back changed=0, so the templated value renders to the same bytes as the previous version (which I’d briefly hardcoded during the diagnosis).

Validation, Properly This Time

The first time around, “did the deploy work” stopped at “msmtp exit 0”. That was wrong. Here’s what the new validation actually checks, end to end:

Config file shows the new directive. On a sample host:
```
grep '^Unattended-Upgrade::Sender' /etc/apt/apt.conf.d/50unattended-upgrades
```
Expect the configured address back. Empty result means the playbook didn’t touch this host.
A real UU run produces a real email. UU only sends mail when packages were actually upgraded (because MailReport "on-change"). To force a send for testing, either wait for the next quiet upgrade day, or trigger it manually with a tagged message that goes through the same /usr/sbin/sendmail symlink:
```
{
  echo "From: unattended-upgrades@yourdomain.example"
  echo "To: you@gmail.example"
  echo "Subject: [post-deploy verify] $(hostname) $(date -Is)"
  echo ""
  echo "This is the same SMTP path UU uses."
} | /usr/sbin/sendmail -t -oi
```
Then open Gmail. Not “check the msmtp log”. Open Gmail.
Confirm the headers. When the next real on-change report arrives in your inbox, expand the headers and look for:
- From: unattended-upgrades@yourdomain.example (not From: root)
- Authentication-Results: showing dkim=pass header.i=@yourdomain.example
- Authentication-Results: showing dmarc=pass
If any of those are off, you’ve got a different problem than the one in this post. (The most likely candidate is that your relay isn’t DKIM-signing for the domain in your From: header. Check the relay’s domain authentication panel.)

A Sub-Issue: `recipients=root` 501 Errors

Wholly separate but worth a side note for anyone running the same audit. Several hosts on my fleet had repeating entries in /var/log/msmtp.log like:

recipients=root smtpstatus=501 errormsg='recipient address root not accepted by the server'

These are not from UU. UU explicitly addresses your configured Mail recipient. The recipients=root ones come from something else on the host (commonly cron’s default MAILTO=root, smartd, or apt-listchanges) handing mail to msmtp with envelope RCPT TO: root. Brevo rejects bare-username recipients at SMTP time with 501.

Two ways to fix it cleanly:

Set aliases /etc/aliases in /etc/msmtprc and add a root: someone@somewhere.example line to /etc/aliases. msmtp will rewrite the recipient before handing it to Brevo.
Track down whatever is hardcoding root as a destination and point it at a real address.

On one host the noise was so heavy (every 30 minutes) that the msmtp log had grown to 651 KB of error-only entries since the deploy. I’d missed it the first time around because nothing further downstream was complaining. Worth a fleet-wide grep for smtpstatus=5 if you’re already in the area.

Takeaway

The whole bug is one default in one Python file:

from_email = apt_pkg.config.find("Unattended-Upgrade::Sender", "root")

A bare root as a default is a sensible thing for a tool that ran for the first time on a UNIX box where local delivery actually meant something. In 2026, with everything going out through a relay that DKIM-signs for a domain you own, that default is a foot-gun. Set Unattended-Upgrade::Sender to something with a @ and a domain that aligns with whatever your relay is signing, and the whole pipeline lights up.

If you’re running unattended-upgrades through msmtp / Postfix / nullmailer / any external relay, go look at your 50unattended-upgrades right now and make sure Sender is set. If it isn’t, your alerts are probably already vanishing into the void.

The next post in this thread will be the audit script itself, with the per-host audit-id register that lets you grep your inbox for “did this specific host’s specific test mail actually arrive”. Sending a 250 OK is not the same as delivering a message, and after this one I’ll never trust an SMTP relay’s accept response as proof of delivery again.

Automating Nextcloud AIO Updates with Bash and Cron

2026-05-14T22:00:00+01:00

I run Nextcloud All-in-One. It’s great. A bundle of containers wired together and managed through one web UI. One-click updates, sane defaults, and most of the moving parts you’d otherwise have to glue together yourself.

The one thing I wanted to change was the manual click-through flow for updates. AIO is designed around a UI-driven update workflow — open the master container’s web UI, click “Update all containers”, wait, click again. Perfectly fine for occasional use, but I’d much rather it just ran on its own on a sensible schedule and logged what it did.

Here’s how I got there with a small bash script and a cron entry.

How AIO Updates Actually Work

Before writing anything I wanted to understand what the AIO master container actually does when you click “Update all containers” in the UI. Once you peel the wrapper off, it boils down to two things:

Pull the new nextcloud/all-in-one:latest image. The mastercontainer is the brain — it pins compatible image versions for every child container. New AIO release = new mastercontainer image = new pinned versions for the children.
Run StartAndUpdateContainers.php. This is the internal job that orchestrates stopping the old child containers, pulling their new images, and starting them back up. The web UI calls it. The internal cron calls it. So can I.

If I invoke that PHP script directly, I get the same update path the UI button kicks off — just without needing a human to click anything.

The Script

This lives at /root/update_nextcloud_aio.sh. Seven steps, each timestamped, gated by set -e so a failure stops everything cleanly. The only part that’s environment-specific is the docker run block in Step 4 — see the note after the script.

#!/bin/bash
#
# Nextcloud AIO Update Script
#

set -e

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }

log "Starting Nextcloud AIO Update Process"

# Step 1: Pull latest AIO mastercontainer image
log "Step 1/7: Pulling latest Nextcloud AIO image..."
docker pull nextcloud/all-in-one:latest

# Step 2: Stop the existing master container
log "Step 2/7: Stopping nextcloud-aio-mastercontainer..."
docker stop nextcloud-aio-mastercontainer

# Step 3: Remove the existing master container
log "Step 3/7: Removing old container..."
docker rm nextcloud-aio-mastercontainer

# Step 4: Recreate master with the exact same configuration
#         (replace this block with whatever YOUR original `docker run` was)
log "Step 4/7: Recreating master container..."
docker run -d \
  --name nextcloud-aio-mastercontainer \
  --restart always \
  --init \
  -p 8080:8080 \
  -v nextcloud_aio_mastercontainer:/mnt/docker-aio-config \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  nextcloud/all-in-one:latest

# Step 5: Wait for master to settle
log "Step 5/7: Waiting 60 seconds for master container to initialize..."
sleep 60

# Step 6: Force every child container to be recreated next start
log "Step 6/7: Stopping and removing all AIO child containers..."
set +e
for c in $(docker ps     --filter "name=nextcloud-aio-" --format "" | grep -v mastercontainer); do
    docker stop "$c"
done
for c in $(docker ps -a  --filter "name=nextcloud-aio-" --format "" | grep -v mastercontainer); do
    docker rm   "$c"
done
set -e

# Step 7: Trigger AIO's internal update cron directly
log "Step 7/7: Triggering Nextcloud update..."
docker exec --user www-data nextcloud-aio-mastercontainer \
    php /var/www/docker-aio/php/src/Cron/StartAndUpdateContainers.php

log "Nextcloud AIO Update Process Completed"

A few things worth calling out:

Step 4’s docker run block is the only non-portable part. Whatever flags, env vars, ports, and volumes you used when you originally created your master container have to be reproduced exactly, or the new master will look at the existing volume and refuse to start. Don’t copy mine — pull yours straight off your existing container with docker inspect nextcloud-aio-mastercontainer before changing a thing.
Step 6 is the gotcha I learned the hard way. My first version of this script just recreated the master and trusted it to handle the children. It didn’t. The mastercontainer happily came up, decided the children were “already running fine”, and skipped the upgrade entirely. Stopping and removing the children first forces StartAndUpdateContainers.php in Step 7 to recreate them from the new pinned images. Without this step the cron PHP is effectively a no-op.
set +e around Step 6 is intentional. Some children may not exist (e.g. you’ve never enabled Collabora). I don’t want a missing container to abort the whole update.

Wiring it Into Cron

Weekly is the sweet spot. Often enough to never be more than a week behind, infrequent enough that point releases have had a chance to settle.

# root crontab
0 4 * * 0  /root/update_nextcloud_aio.sh >> /root/nextcloud_update.log 2>&1

Sunday morning, well outside any usage window. If anything explodes the worst case is “roll back from last night’s backup” and life carries on.

Verifying It’s Actually Working

A script that “looks like” it’s doing something every Sunday isn’t worth much without proof. Three quick checks:

1. Did the most recent run succeed?

grep -E 'Starting|Update command executed|ERROR|WARNING|Completed' \
    /root/nextcloud_update.log | tail -20

2. How many runs have completed cleanly?

echo "Successful runs: $(grep -c 'Update command executed successfully' /root/nextcloud_update.log)"
echo "Errors/warnings: $(grep -cE 'ERROR|WARNING'                     /root/nextcloud_update.log)"

I’ve been running this for a few months now with 0 errors and 0 warnings across every weekly invocation. Run durations sit between roughly one and five minutes.

3. What version is Nextcloud actually on, and is anything pending?

docker exec -u www-data nextcloud-aio-nextcloud php occ status
docker exec -u www-data nextcloud-aio-nextcloud php occ update:check

occ status confirms the running versionstring, and occ update:check will tell you if a newer point release is available — which is the real test of whether the script is actually moving you forward, not just running successfully.

A Subtle Point: “Successful” Doesn’t Mean “Updated”

Worth flagging because I caught myself on it. If the AIO project hasn’t published a new mastercontainer image since your last run, your docker pull legitimately gets the same digest, the script runs cleanly, the children get recreated from the same pinned versions — and your Nextcloud version doesn’t change. That’s not a script failure, that’s the script doing exactly what it should.

The way to sanity-check is to compare the running version against the Nextcloud changelog. If occ status reports an older point release than the latest in the changelog but the AIO image hasn’t bumped yet, the bottleneck is upstream’s release cadence — not your automation. The next scheduled run will pick it up the moment AIO publishes.

What I’d Improve Next

Email alert on failure. Right now I have to grep the log. Trivial to wire mail or an SMTP relay into a trap so any non-zero exit pings me.
Log rotation. The log file just grows. A small logrotate config to weekly-rotate it with a reasonable retention would be tidy.
Pre-flight version capture. Logging the running Nextcloud version before and after would make it obvious at a glance which weekly runs actually delivered a new release.
Master container health probe at the end — a quick check that the mastercontainer came back up cleanly before logging “Completed”.

Takeaway

The clean way to automate AIO is to do exactly what the master container does internally — pull the new image, recreate the children, run StartAndUpdateContainers.php — but call it directly so it can run on a schedule without any human in the loop. Wrap it in cron, log it, verify it weekly, and it just runs.

Take the script, swap your own docker run flags into Step 4, and you’re done.

I Built a GNOME Shell Extension for Tailscale — Panel Toggle, Peer Browser, and the Signal-Handler Gotcha That Broke It

2026-05-14T22:00:00+01:00

I run Tailscale on every machine I own. My homelab is stitched together with it, my laptop joins the tailnet on boot, and at this point I treat 100.64.0.0/10 like it’s part of my own LAN. It’s brilliant.

What’s not brilliant is the day-to-day UX on Linux. The CLI is excellent — but it lives in a terminal. GNOME’s built-in VPN panel applet doesn’t speak Tailscale’s control protocol, so all the things I actually click on (toggle, exit node, copy a peer’s IP, check who’s online) live behind tailscale subcommands. Every time I needed a peer’s IPv4 I’d open a terminal, type tailscale status, scroll, and copy. Every time.

So I built the small thing that should have always existed: a GNOME Shell panel indicator that wraps the bits of the Tailscale CLI you actually click on, without changing your daemon configuration unless you explicitly ask it to.

It’s called gnome-tailscale, it ships for GNOME Shell 48, 49, and 50, and it has roughly one bug I’m still slightly embarrassed about. Here’s the writeup.

The Setup

A laptop running Ubuntu 24.04 (GNOME Shell 46 → 48 after upgrade)
A desktop running Fedora 40 (GNOME Shell 48)
A future-me on Ubuntu 26.04 “Resolute Raccoon” (GNOME Shell 50)
Tailscale CLI installed everywhere, tailscaled always running

The goal: a panel indicator that reflects daemon state, lets me toggle the daemon, lists my tailnet, copies peer IPs on click, picks an exit node from a submenu, and surfaces actionable error messages — no terminal required.

Why I Couldn’t Reuse Anything Existing

Before writing a line of GJS, I went through the usual dead ends:

GNOME’s built-in VPN panel. It speaks NetworkManager. Tailscale is a userland mesh — it doesn’t expose itself as an NM connection. Dead end.
The “Tailscale Status” extensions on extensions.gnome.org. Most are stuck on GNOME Shell 42–45 (the old imports.* CommonJS world). Shell 48 is fully ESM (import / export), and the old-API extensions are not loadable at all on Shell 48+. Re-skinning a 3-year-old codebase to ESM was going to take longer than starting fresh.
A standalone tray app via Ayatana AppIndicator. Works, but doesn’t blend into the panel and breaks every time GNOME twitches its mind about tray icons.
A bash script bound to a keyboard shortcut. Toggle works, but there’s nowhere to show the peer list.

The only sensible option was a native shell extension targeted at the GJS ESM era — Shell 48, 49, and 50.

The Architecture

I wanted the extension to be small and testable. GJS is fun until you try to unit-test it; the runtime is bound to GNOME Shell, so anything that touches St, Clutter, or Gio won’t run under plain Node.

So I split the codebase three ways:

File	Runtime	What’s in it
`extension.js`	GJS	The panel indicator — `St` widgets, menu items, the polling loop.
`prefs.js`	GJS (Adwaita)	The preferences window.
`lib/util.js`	Pure JS	Formatting, sorting, argv builders, error classification.

lib/util.js is the trick. Anything that’s pure logic — parsing tailscale status --json, sorting peers, working out which tailscale argv to spawn for a given preference combination, classifying error output into one of about eight known categories — lives there with zero GJS imports. It’s runnable under Node’s built-in test runner, which means CI can lint and test the extension without ever touching a real GNOME Shell.

make test       # runs node --test on tests/*
make lint       # eslint on the whole tree
make schema     # compiles the GSettings schema
make ci         # everything CI runs
make pack       # builds the release zip

The whole thing is < 2,000 lines including tests.

The “Don’t Touch My Daemon” Principle

The single most important design decision: toggling the panel switch does not change your daemon configuration. Ever. By default, the toggle runs tailscale up and that’s it. It does not push --accept-routes, it does not push --accept-dns, it does not run tailscale set for anything.

Why that matters: if you’ve spent an evening tuning your tailscaled flags exactly the way you like them, the last thing you want is a friendly little panel applet quietly overwriting them every time you click it. I’ve been bitten by exactly that on other VPN GUIs.

There’s a single switch in prefs called Override accept-routes / accept-dns on connect. It’s off by default. Turn it on if you want the panel to actively manage those flags via tailscale set after each up. Otherwise the panel is purely an observer plus toggle.

The privileged-command path is similar:

Setting	Default	Behaviour
Use pkexec for up/down	on	Privileged `tailscale up`/`down` go through a polkit dialog.
Use pkexec for up/down	off	Assumes you’ve run `sudo tailscale set --operator=$USER` and `tailscale` runs without sudo.

The prefs window literally tells you the two recipes (Option A: --operator, Option B: a sudoers alias) and explains that Option B is a terminal-only convenience and won’t help the panel toggle. Trying to be a polite citizen of someone else’s machine.

The Polling Loop

tailscale status --json is the source of truth. The extension polls it every 5 seconds (configurable) and rebuilds the menu from scratch each tick. Nothing about the peer list or exit-node list is hardcoded — every menu item exists because it appeared in the most recent JSON.

// extension.js (simplified)
async _tick() {
    let proc;
    try {
        proc = Gio.Subprocess.new(
            ['tailscale', 'status', '--json'],
            Gio.SubprocessFlags.STDOUT_PIPE | Gio.SubprocessFlags.STDERR_PIPE
        );
    } catch (e) {
        return this._showError(classifyError(e));
    }

    const [, stdout, stderr] = await proc.communicate_utf8_async(null, null);
    if (!proc.get_successful()) {
        return this._showError(classifyError(stderr));
    }

    const status = JSON.parse(stdout);
    this._render(status);   // pure: status -> menu items
}

classifyError lives in lib/util.js and is unit-tested. It maps stderr blobs onto a small enum:

// lib/util.js
export function classifyError(stderr) {
    if (/command not found|ENOENT/.test(stderr))   return 'CLI_MISSING';
    if (/not running|connection refused/i.test(stderr)) return 'DAEMON_DOWN';
    if (/Logged out|please run.*tailscale up/i.test(stderr)) return 'LOGGED_OUT';
    if (/Authentication cancelled|polkit/i.test(stderr)) return 'PKEXEC_CANCELLED';
    if (/permission denied|operator/i.test(stderr)) return 'NO_OPERATOR';
    return 'UNKNOWN';
}

That enum drives both the user-facing notification copy and a Copy error details item in the menu, so when something does go sideways you can paste the raw stderr into a bug report instead of squinting at a vague “something went wrong”.

The Bug That Took Me a Whole Evening

Here’s the embarrassing one. Early users (i.e. me) reported:

The toggle works the first time. After that, clicking it does nothing.

The toggle is a PopupSwitchMenuItem. Naively, you connect to its 'toggled' signal and call tailscale up or tailscale down accordingly:

// THE BUG
this._toggleItem.connect('toggled', (item, active) => {
    if (active) this._tailscaleUp();
    else        this._tailscaleDown();
});

Then every poll, you reflect the real daemon state back onto the switch:

// THE BUG, continued
_render(status) {
    this._toggleItem.setToggleState(status.BackendState === 'Running');
    // ...
}

Spot it? setToggleState() fires the toggled signal. So:

User clicks the switch → 'toggled' fires with active=true → tailscale up runs.
Five seconds later, poll completes, setToggleState(true) is called.
setToggleState(true) fires 'toggled' again with active=true.
tailscale up runs again — harmless because the daemon is already up.
User clicks the switch off → 'toggled' fires with active=false → tailscale down runs.
Five seconds later, poll completes, setToggleState(false) is called.
setToggleState(false) fires 'toggled' again with active=false.
tailscale down runs again. Daemon is already down. Still harmless.
User clicks the switch on → 'toggled' fires with active=true…

…except by step 9, the recursive 'toggled' from step 7 has also fired, and the user-initiated state change races against the programmatic one. Depending on which finishes first, the switch can end up visually off while my handler genuinely thought the user wanted it on. From the user’s perspective: clicking does nothing.

The fix is one line, sort of:

// THE FIX (extension.js)
this._toggleHandlerId = this._toggleItem.connect('toggled', (_, active) => {
    if (active) this._tailscaleUp();
    else        this._tailscaleDown();
});

_render(status) {
    // Block the handler while we mirror daemon state onto the switch,
    // so the programmatic update doesn't re-fire 'toggled'.
    this._toggleItem.block_signal_handler(this._toggleHandlerId);
    try {
        this._toggleItem.setToggleState(status.BackendState === 'Running');
    } finally {
        this._toggleItem.unblock_signal_handler(this._toggleHandlerId);
    }
    // ...
}

block_signal_handler / unblock_signal_handler are GObject’s standard “shut up for a moment” pair. The try/finally is non-negotiable: if setToggleState ever throws, an unblocked handler is required for the next poll to recover, otherwise the switch goes dead permanently.

This is the kind of bug that doesn’t show up in unit tests because the unit tests can’t import St. It only shows up when a real human clicks the switch on a real GNOME Shell. Lesson learned: when in doubt, block the handler before mirroring state.

The fix shipped in 0.2.0. There’s even a row in the troubleshooting table for it, because I wanted future-me to be able to find it.

What Made It Onto the Panel

After a few iterations, the menu settled into this shape:

Section	What it shows
Self	This machine’s hostname, MagicDNS short name, OS, online dot. Click copies its Tailscale IPv4.
Toggle	A `PopupSwitchMenuItem` that runs `tailscale up`/`down` (via pkexec by default).
Exit Node ▸	Every peer reported with `--advertise-exit-node`, with a green/grey dot, OS in plain text. Plus a None row to clear.
Peers ▸	All peers from `tailscale status --json` — dot, name, OS, primary IPv4, tags (`exit`, `active`). Click copies IPv4 (or MagicDNS name — configurable).
Quick links	Admin console, manual refresh, preferences.
Errors	When something goes wrong, a notification + a Copy error details item on the menu. Errors are also written to `journalctl` with a `[tailscale]` prefix.

The panel icon flips between connected/disconnected glyphs based on BackendState. That’s it. No popups, no modal dialogs, no surprise reconfigurations. The whole thing is < 2,000 lines including tests.

Gotchas I Hit Along the Way

A few things that were less obvious than they should have been:

1. Symlink installs will eat your source tree

gnome-extensions install copies your zip into ~/.local/share/gnome-shell/extensions/. If you symlink your dev tree there instead (which I do, via make link), and then you click Uninstall in the GNOME Extensions app — GNOME deletes the contents of the symlink target. That is to say: your source tree.

I’ve put a comically aggressive warning in the README and make link itself prints a reminder. There’s also a make uninstall that does the right thing (remove the symlink, not the target).

2. Shell 48 is ESM. Shell 45 is not.

If you’re porting an old extension, imports.misc.extensionUtils is gone. Main.panel.addToStatusArea is still there. St/Clutter/Gio you import from gi://. The Extension base class has lifecycle methods (enable, disable) that you actually have to implement properly because nothing magic happens for you. The migration is mechanical but tedious.

3. `Gio.Subprocess` is your friend

The naive way to spawn tailscale is GLib.spawn_command_line_sync. Don’t. It blocks the shell — and “the shell” here is literal GNOME Shell, the thing rendering your entire desktop. A 200ms hang in tailscale status becomes a 200ms freeze of every window animation on your screen. Use Gio.Subprocess with communicate_utf8_async, await the promise, and never block.

4. Adwaita prefs windows are easier than you’d think

GNOME 42+ extensions can use full Adwaita widgets in prefs.js. AdwPreferencesPage + AdwPreferencesGroup + AdwActionRow/AdwSwitchRow/AdwSpinRow give you a prefs UI that looks identical to GNOME Settings. Hooking each row up to a Gio.Settings instance is a one-liner per setting (settings.bind('key', row, 'active', Gio.SettingsBindFlags.DEFAULT)).

Releasing It

The release pipeline is just a Makefile target and a GitHub Actions workflow:

make pack       # produces dist/tailscale@Joshwaamein.github.io.shell-extension.zip

make pack runs make ci first (lint + tests + schema compile), then bundles the extension into the zip layout that gnome-extensions install --force accepts. CI uploads the zip as a release asset on every tag. Anyone can install with:

gnome-extensions install --force tailscale@Joshwaamein.github.io.shell-extension.zip
gnome-extensions enable tailscale@Joshwaamein.github.io

Wayland users have to log out and back in once, because Shell can’t hot-load a new extension on Wayland. X11 users can press Alt+F2, type r, and hit Enter — old-school, but it still works.

What I’d Do Differently

A few things I’d change if starting again:

Bind state with a tiny store. I rebuild the entire menu on every poll. That’s fine at 5-second intervals and a dozen peers, but it does mean you sometimes see a flash if you’re hovering an item exactly when the poll completes. A diff-based renderer (or just remembering which submenu was open and reopening it after rebuild) would be nicer.
Cache tailscale --version once at startup, not on every error path. I currently shell out to it whenever I want to render a “your CLI is too old” hint, which is wasteful.
Push releases to extensions.gnome.org. Right now you install from the GitHub release zip. e.g.o. has a review process I haven’t bothered with yet.

The Result

I haven’t typed tailscale status in a terminal for weeks. Toggling the daemon is a click. Copying a peer’s IPv4 is a click. Picking an exit node when I’m on hotel Wi-Fi is two clicks. None of it changes my daemon configuration unless I’ve explicitly opted in. And when something does go wrong — daemon down, CLI missing, login expired — the panel tells me what specifically is broken instead of vaguely failing.

It’s open source, GPL-2.0-or-later (same family as GNOME Shell itself), and lives at github.com/Joshwaamein/gnome-tailscale. PRs welcome — there’s a CONTRIBUTING.md and the lib/util.js split means new logic comes with tests.

Sometimes the tool you want is a 2,000-line Saturday project away.

How I Fixed SSL Certificate Warnings Across My Entire Proxmox Homelab — With Full Auto-Renewal and Email Alerts

2026-04-26T00:00:00+01:00

If you run a Proxmox homelab, you know the drill. You open your PVE or PBS web UI and Chrome hits you with the red “Your connection is not private” screen. You click Advanced, you click Proceed, and you feel slightly bad about it. Every. Single. Time.

I finally fixed it — for all my servers at once, fully automated, with email alerts on every renewal. Here’s the complete guide including the gotcha that broke my backups immediately after, and how I fixed that too.

My Setup

3 × Proxmox VE nodes
4 × Proxmox Backup Server nodes
All private, accessible via Tailscale only
Domain managed by Cloudflare

Why Standard Let’s Encrypt Doesn’t Work Here

The usual HTTP-01 challenge requires your server to be reachable on port 80 from the internet. My servers are behind Tailscale — they’re not reachable from the internet at all. HTTP-01 is a non-starter.

The answer is the DNS-01 challenge. You prove domain ownership by creating a TXT record in your DNS zone instead. Let’s Encrypt checks the TXT record, issues the cert, and your server never needs to be publicly accessible. If your DNS is managed by Cloudflare (or most other major providers), this is fully automatable.

The Wildcard Strategy

Rather than getting individual certificates — separate challenges, separate renewal timers, separate deploy jobs — I issued a single wildcard certificate for *.yourdomain.com.

One cert. One renewal. One deploy script. Covers every subdomain on the domain.

Step 1: Install acme.sh

acme.sh is a shell script ACME client with native Cloudflare support. Install on your management machine (wherever you SSH from):

curl https://get.acme.sh | sh -s email=your@email.com

This installs to ~/.acme.sh/ and adds a daily cron job automatically.

Step 2: Create a Cloudflare API Token

In Cloudflare: My Profile → API Tokens → Create Token → Edit zone DNS.

Scope it tightly:

Permissions: Zone → DNS → Edit
Zone Resources: Include → Specific zone → yourdomain.com

You also need your Zone ID from the Cloudflare dashboard Overview page.

Step 3: Issue the Wildcard Cert

export CF_Token="your-cloudflare-api-token"
export CF_Zone_ID="your-zone-id"

~/.acme.sh/acme.sh --issue \
  --dns dns_cf \
  -d "*.yourdomain.com" \
  --server letsencrypt

acme.sh:

Creates _acme-challenge.yourdomain.com TXT record via Cloudflare API
Waits for DNS propagation
Asks Let’s Encrypt to verify it
Gets your cert
Deletes the TXT record

Takes about 40 seconds. No ports opened, no firewall changes. The cert lands at:

~/.acme.sh/*.yourdomain.com_ecc/
├── *.yourdomain.com.key      # private key
├── *.yourdomain.com.cer      # certificate
├── ca.cer                    # intermediate CA
└── fullchain.cer             # cert + chain (use this)

Step 4: Deploy to All Servers

Proxmox VE and PBS both support dropping a cert into a specific path and restarting the proxy service.

PVE nodes (port 8006):

scp fullchain.cer root@pve1:/etc/pve/local/pveproxy-ssl.pem
scp *.yourdomain.com.key root@pve1:/etc/pve/local/pveproxy-ssl.key
ssh root@pve1 "systemctl restart pveproxy"

PBS nodes (port 8007):

scp fullchain.cer root@pbs1:/etc/proxmox-backup/proxy.pem
scp *.yourdomain.com.key root@pbs1:/etc/proxmox-backup/proxy.key
ssh root@pbs1 "systemctl restart proxmox-backup-proxy"

PVE gotcha: /etc/pve/ is a FUSE filesystem called pmxcfs. If you try to chmod the cert files you’ll get “Operation not permitted”. This is normal and harmless — ignore it.

I scripted this to loop over all servers. Total deploy time: ~30 seconds.

The Gotcha: PBS Fingerprints in storage.cfg

Here’s the thing nobody mentions. After deploying the new certs, my PVE nodes couldn’t connect to my PBS servers anymore.

The reason: every PBS storage definition in PVE’s storage.cfg contains a fingerprint line — the SHA256 fingerprint of the PBS server’s certificate. PVE uses this to verify it’s talking to the right server:

pbs: pbs1
    server pbs1
    fingerprint fa:f0:14:a5:74:79:e8:...  ← old self-signed cert fingerprint
    username backup@pbs!pve1-backup

When we replaced the PBS cert with the new Let’s Encrypt cert, the fingerprint changed. PVE saw the mismatch and refused the connection.

Fix: update the fingerprint on every PBS storage entry on every PVE node.

# Get the new fingerprint from one PBS server
NEW_FP=$(echo | openssl s_client -connect pbs1:8007 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout 2>/dev/null \
  | sed 's/sha256 Fingerprint=//' \
  | tr '[:upper:]' '[:lower:]')

# Update all PBS storages on each PVE node
for storage in $(grep '^pbs:' /etc/pve/storage.cfg | awk '{print $2}'); do
  pvesh set /storage/$storage --fingerprint "$NEW_FP"
done

Run this on each PVE node. PBS connections restored immediately.

This needs to happen every time the cert renews. So I built it into the auto-renewal script.

The Second Gotcha: PBS Sync Job Remotes

After fixing the PVE storage fingerprints, I thought I was done. Then my backup PBS node started failing its sync jobs:

WARNING: certificate fingerprint does not match expected fingerprint!
expected: fa:f0:14:a5:74:79:e8:...
certificate validation failed - Certificate fingerprint was not confirmed.

That node pulls backups from the other PBS nodes using PBS sync jobs. Those sync jobs connect via remote definitions — and remote definitions also store the cert fingerprint. These are completely separate from PVE’s storage.cfg.

Fix: update the remote definitions on the syncing PBS node:

for remote in pbs1 pbs2 pbs3; do
  proxmox-backup-manager remote update $remote --fingerprint "$NEW_FP"
done

So there are actually two places fingerprints need updating after a cert change:

PVE storage.cfg — for PVE → PBS backup jobs (via pvesh set /storage/...)
PBS remote definitions — for PBS → PBS sync jobs (via proxmox-backup-manager remote update)

Both are now handled by the deploy script.

Step 5: The Auto-Renewal Script

The script does more than just copy files. Here’s what a production-ready version needs to handle:

Deploy cert to all PVE nodes → restart pveproxy
Deploy cert to all PBS nodes → restart proxmox-backup-proxy
Wait for PBS to come back up (poll, don’t just sleep)
Read the new fingerprint from PBS
Update PBS storage fingerprints on all PVE nodes (pvesh set /storage/...)
Update PBS sync remote fingerprints (proxmox-backup-manager remote update ...)
Email on start, success, and failure

A few gotchas to avoid:

Use $HOME not ~ — tilde doesn’t always expand in non-interactive cron context
Use BatchMode=yes in SSH options — interactive prompts will hang cron indefinitely
Use a heredoc for the email body, not inline quoting — log content containing apostrophes breaks the command
Use set -euo pipefail — fail fast on unexpected errors
Validate cert files exist before doing anything

The key structural pattern:

#!/bin/bash
set -euo pipefail

ACME_DIR="$HOME/.acme.sh"
CERT_DIR="$ACME_DIR/*.yourdomain.com_ecc"
CERT="$CERT_DIR/fullchain.cer"
KEY="$CERT_DIR/$(ls "$CERT_DIR" | grep '\.key$' | grep -v fullchain | head -1)"
LOG_FILE="$ACME_DIR/deploy-proxmox.log"

SSH_OPTS="-o ConnectTimeout=15 -o StrictHostKeyChecking=no -o BatchMode=yes"

# Validate cert files exist
if [ ! -f "$CERT" ] || [ ! -f "$KEY" ]; then
  echo "ERROR: Cert files not found" | tee -a "$LOG_FILE"; exit 1
fi

# Deploy to all servers...

# Poll PBS until it responds instead of blind sleep
wait_for_port() {
  local host="$1" port="$2" timeout="${3:-30}" elapsed=0
  while ! echo | openssl s_client -connect "$host:$port" 2>/dev/null | grep -q 'BEGIN CERTIFICATE'; do
    sleep 2; elapsed=$((elapsed + 2))
    [ $elapsed -ge $timeout ] && return 1
  done
}

# Read new fingerprint and update both locations
NEW_FP=$(echo | openssl s_client -connect pbs1:8007 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout 2>/dev/null \
  | sed 's/sha256 Fingerprint=//' | tr '[:upper:]' '[:lower:]')

# 1. PVE storage.cfg fingerprints
for host in $PVE_NODES; do
  ssh $SSH_OPTS root@$host "
    for s in \$(grep '^pbs:' /etc/pve/storage.cfg | awk '{print \$2}'); do
      pvesh set /storage/\$s --fingerprint '$NEW_FP' 2>/dev/null
    done"
done

# 2. PBS sync remote fingerprints
ssh $SSH_OPTS root@$PBS_SYNC_NODE "
  for remote in pbs1 pbs2 pbs3; do
    proxmox-backup-manager remote update \$remote --fingerprint '$NEW_FP' 2>/dev/null
  done"

~/.acme.sh/acme.sh --install-cert -d "*.yourdomain.com" \
  --reloadcmd "~/.acme.sh/deploy-proxmox.sh"

Now every time acme.sh renews the cert (automatically, ~60 days in), this script runs and handles the entire chain.

Step 6: Email Notifications

I wanted to know when this ran — success or failure. My management machine doesn’t have sendmail configured, but my PVE nodes do (via msmtp + Brevo). I added a send_email() function to the deploy script that SSH’s into pve1 to relay the email:

send_email() {
  local subject="$1"
  local body="$2"
  ssh root@pve1 \
    "printf 'Subject: %s\nFrom: proxmox-alerts@yourdomain.com\nTo: you@email.com\n\n%s\n' \
    '$subject' '$body' | /usr/sbin/sendmail -f proxmox-alerts@yourdomain.com you@email.com"
}

# At start:
send_email "🔄 [proxmox] Cert renewal started" "Deploy started at $(date)"

# At end (success):
send_email "✅ [proxmox] Cert renewal succeeded" "$LOG\nNew fingerprint: $NEW_FP"

# On failure:
send_email "❌ [proxmox] Cert renewal FAILED" "$LOG"

If you haven’t set up SMTP on your PVE nodes yet, I covered that in Why I Switched From Gmail to Brevo for All My Homelab Email Alerts.

Step 7: DNS Records

The wildcard cert covers *.yourdomain.com, but for your browser to reach pve2.yourdomain.com it needs a DNS A record.

Create records in Cloudflare pointing to your Tailscale IPs, with proxying disabled:

pve1.yourdomain.com  A  100.x.x.x  (proxied: off, TTL: 3600)
pve2.yourdomain.com  A  100.x.x.x
pbs1.yourdomain.com  A  100.x.x.x
...

Tailscale IPs are only routable within your Tailnet. The records are technically public, but anyone outside your network who looks them up gets an IP they can’t reach. It’s security through inaccessibility.

If you’d rather have zero public DNS footprint, add entries to your local /etc/hosts instead:

100.x.x.x  pve2.yourdomain.com

The Full Automated Flow

Every ~60 days, without any manual intervention:

acme.sh cron fires (daily at a random time, checks if renewal needed)
DNS-01 challenge runs — temporary TXT record created and deleted via Cloudflare API
New cert issued by Let’s Encrypt
deploy-proxmox.sh runs:
- 🔄 Start email sent
- New cert deployed to all servers via SCP
- All proxies restarted
- New fingerprint read from PBS
- All PBS storage fingerprints updated on PVE nodes
- PBS sync remote fingerprints updated
- ✅ Success email sent with full log + fingerprint + expiry date
- ❌ Failure email sent if anything went wrong

Zero manual steps required. You get notified either way.

Summary

What	How
Cert type	Let’s Encrypt wildcard `*.yourdomain.com`
ACME challenge	DNS-01 via Cloudflare API
Client	acme.sh
Deployment	scp + systemctl restart
Fingerprint update	pvesh set /storage on all PVE nodes
Email alerts	msmtp relay via PVE node
Auto-renewal	acme.sh cron + custom deploy hook
Time to set up	~15 minutes
Ongoing maintenance	None

The PBS fingerprint step is the non-obvious part that will break your backups if you miss it. Build it into your deploy script from the start and you’ll never have to think about it again.

Running ComfyUI on an AMD RX 7900 XTX — Native ROCm 7.1 on Windows

2026-04-05T00:00:00+01:00

AMD ROCm 7.1 now runs natively on Windows. Here’s how I used it to get ComfyUI running on a gaming PC with an RX 7900 XTX — no Zluda, no translation layer, full GPU acceleration.

The Problem

My main machine is a Windows gaming PC with an AMD RX 7900 XTX (24GB VRAM). I can’t switch to Linux because of kernel-level anti-cheat — Riot Vanguard, EasyAntiCheat, BattlEye. These don’t run under Wine or Proton.

The traditional options for running ComfyUI on AMD hardware on Windows were:

DirectML — works, but significantly slower than ROCm or CUDA. Not viable for video generation.
Zluda — a CUDA translation layer for AMD. Works for some models, but requires specific forks, is fragile, and adds complexity.
ROCm on Linux — the gold standard, but requires dual-booting or a separate machine.

Then AMD shipped ROCm 7.1 for Windows in late 2025. torch.cuda.is_available() returns True on the RX 7900 XTX. The full pipeline runs natively on GPU.

What’s Already Required

Before starting, you need:

AMD HIP SDK 7.1 installed — available from AMD’s developer site. The installer sets HIP_PATH as a system environment variable automatically.
AMD Adrenalin driver 25.20.01.17 or newer — the preview driver that enables ROCm on Windows. Check AMD’s release notes for the latest.
Python 3.12 — the ROCm PyTorch wheels are built for cp312 specifically.
Git — for cloning ComfyUI and custom nodes.

You can verify your HIP SDK is installed:

echo $env:HIP_PATH
# Should output: C:\Program Files\AMD\ROCm\7.1\

Installing uv

I use uv as the package manager — it’s significantly faster than pip for large installs like the ROCm SDK wheels (which are several GB).

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

uv installs to C:\Users\\.local\bin\. Since each terminal session won’t have it on PATH yet, I reference it by full path throughout this guide.

Cloning ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI.git O:\ComfyUI

I’m installing to O:\ComfyUI — a dedicated SSD with plenty of space. Models alone can be 10–50GB+, so pick a drive accordingly.

Creating the Python Environment

C:\Users\joshu\.local\bin\uv.exe venv O:\ComfyUI\.venv --python 3.12

Note: uv venv needs an absolute path to the target directory, not a relative one, when running from a different drive.

Installing ROCm SDK Wheels

AMD publishes ROCm Python wheels at repo.radeon.com. Install the SDK first:

C:\Users\joshu\.local\bin\uv.exe pip install --no-cache `
  --python O:\ComfyUI\.venv\Scripts\python.exe `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_core-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_devel-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_libraries_custom-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm-0.1.dev0.tar.gz

This downloads ~3.3GB. The --no-cache flag is important here — uv’s cache is on C: by default, and these wheels are large enough that you don’t want them cached if C: is tight.

Installing ROCm PyTorch

C:\Users\joshu\.local\bin\uv.exe pip install --no-cache `
  --python O:\ComfyUI\.venv\Scripts\python.exe `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torch-2.9.0+rocmsdk20251116-cp312-cp312-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchaudio-2.9.0+rocmsdk20251116-cp312-cp312-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchvision-0.24.0+rocmsdk20251116-cp312-cp312-win_amd64.whl

Installing ComfyUI Requirements

C:\Users\joshu\.local\bin\uv.exe pip install --no-cache `
  --python O:\ComfyUI\.venv\Scripts\python.exe `
  -r O:\ComfyUI\requirements.txt

Custom Nodes

I installed four custom nodes for video generation:

git clone https://github.com/ltdrdata/ComfyUI-Manager         O:\ComfyUI\custom_nodes\ComfyUI-Manager
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper     O:\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper
git clone https://github.com/Lightricks/ComfyUI-LTXVideo       O:\ComfyUI\custom_nodes\ComfyUI-LTXVideo
git clone https://github.com/kijai/ComfyUI-FramePackWrapper    O:\ComfyUI\custom_nodes\ComfyUI-FramePackWrapper

Important: lllyasviel/FramePack is a standalone Gradio app, not a ComfyUI custom node. It has no __init__.py and will fail to load. Use kijai/ComfyUI-FramePackWrapper instead.

Install their requirements. The first three can be installed together:

C:\Users\joshu\.local\bin\uv.exe pip install --no-cache `
  --python O:\ComfyUI\.venv\Scripts\python.exe `
  -r O:\ComfyUI\custom_nodes\ComfyUI-Manager\requirements.txt `
  -r O:\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\requirements.txt `
  -r O:\ComfyUI\custom_nodes\ComfyUI-LTXVideo\requirements.txt

Then FramePackWrapper separately (its requirements are clean and already satisfied):

C:\Users\joshu\.local\bin\uv.exe pip install --no-cache `
  --python O:\ComfyUI\.venv\Scripts\python.exe `
  -r O:\ComfyUI\custom_nodes\ComfyUI-FramePackWrapper\requirements.txt

Why separate? The standalone lllyasviel/FramePack repo pins transformers==4.46.2, which conflicts with ComfyUI-LTXVideo requiring transformers>=4.50.0. If you accidentally install FramePack’s requirements, uv will refuse to resolve the dependency graph. FramePackWrapper doesn’t have this problem.

Launcher Scripts

The three environment variables below are essential for stable operation on AMD hardware:

Variable	Value	Effect
`PYTORCH_NO_HIP_MEMORY_CACHING`	`1`	Saves ~1/3 VRAM, prevents OOM on long video runs
`HIP_VISIBLE_DEVICES`	`0`	Targets the RX 7900 XTX, ignores Intel iGPU
`HSA_OVERRIDE_GFX_VERSION`	`11.0.0`	Forces gfx1100 (RDNA3) compatibility

PYTORCH_NO_HIP_MEMORY_CACHING=1 is the most important one. Without it, ROCm caches GPU memory aggressively and you’ll hit OOM errors during 81-frame video generation runs.

O:\ComfyUI\launch_comfyui.ps1:

# ComfyUI Launcher for AMD Radeon RX 7900 XTX (ROCm 7.1 / Windows)
$env:PYTORCH_NO_HIP_MEMORY_CACHING = "1"
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "11.0.0"

& "$PSScriptRoot\.venv\Scripts\Activate.ps1"

Write-Host "Starting ComfyUI on http://127.0.0.1:8188 ..." -ForegroundColor Cyan
& "$PSScriptRoot\.venv\Scripts\python.exe" "$PSScriptRoot\main.py" --listen 0.0.0.0 --port 8188

O:\ComfyUI\launch_comfyui.bat (double-click launcher):

@echo off
powershell.exe -ExecutionPolicy Bypass -File "%~dp0launch_comfyui.ps1"
pause

Validating the GPU

Before launching ComfyUI, verify the GPU is detected:

O:\ComfyUI\.venv\Scripts\python.exe -c "
import torch
print('Torch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('Device name:', torch.cuda.get_device_name(0))
"

Expected output:

[WARNING] failed to run amdgpu-arch: binary not found.
Torch version: 2.9.0+rocmsdk20251116
CUDA available: True
Device name: AMD Radeon RX 7900 XTX

The amdgpu-arch warning is harmless — it’s a compile-time tool that isn’t needed at runtime.

Run a quick GPU compute test:

O:\ComfyUI\.venv\Scripts\python.exe -c "
import torch
x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()
z = torch.mm(x, y)
print('GPU matmul OK, sum:', z.sum().item())
"

First Launch

.\launch_comfyui.bat

Navigate to http://127.0.0.1:8188.

Note: Use 127.0.0.1:8188, not localhost:8188. Chrome sometimes returns a 403 on localhost due to HSTS preloading.

ComfyUI startup output confirms everything is working:

pytorch version: 2.9.0+rocmsdk20251116
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1100
ROCm version: (7, 1)
Total VRAM 24560 MB, total RAM 32482 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 XTX : native

Key things to check:

AMD arch: gfx1100 — correct RDNA3 architecture
Device: cuda:0 AMD Radeon RX 7900 XTX : native — running natively, not via a translation layer
Set vram state to: NORMAL_VRAM — 24GB is enough that ComfyUI isn’t in a reduced-VRAM mode

The comfy-aimdo warning on startup is also harmless — it’s an Nvidia-only optimisation that self-reports as unsupported and skips itself.

Model Placement

ComfyUI uses separate folders for each model type. The default LTX-Video workflow that loads on first launch needs three models (19.27 GB total) — click “Download all” in the Missing Models dialog and ComfyUI places them automatically.

For manual placement:

Model type	Folder
Diffusion model (main checkpoint)	`O:\ComfyUI\models\diffusion_models\`
Text encoders (T5, CLIP, Qwen)	`O:\ComfyUI\models\text_encoders\`
VAE	`O:\ComfyUI\models\vae\`
CLIP Vision (for image-to-video)	`O:\ComfyUI\models\clip_vision\`
LoRAs	`O:\ComfyUI\models\loras\`
Upscale models	`O:\ComfyUI\models\upscale_models\`

Wan2.1 i2v 480p

File	Folder
`wan2.1_i2v_480p_14B_fp8_scaled.safetensors`	`diffusion_models\`
`umt5-xxl_fp8_e4m3fn.safetensors`	`text_encoders\`
`wan_2.1_vae.safetensors`	`vae\`
`clip_vision_h.safetensors`	`clip_vision\`

Use ComfyUI-Manager → Model Manager to download models directly into the correct folders without having to know the paths.

Performance

Benchmarked on RX 7900 XTX, ROCm 7.1, PYTORCH_NO_HIP_MEMORY_CACHING=1:

Workflow	Resolution	Frames	Steps	Time
Wan2.1 i2v	480×704	81	25	~40 min
Wan2.1 t2v	480×704	81	25	~5–6 min
LTX-Video t2v	512×512	25	20	~2–3 min

These are slow compared to CUDA on equivalent Nvidia hardware, but they work reliably without OOM errors. The DirectML backend is significantly slower still — ROCm is the right path for AMD on Windows.

Quality vs Speed: FP8 vs BF16

The models come in different precision variants. Understanding the trade-offs helps you get the most out of 24GB VRAM:

Format	Memory	Quality	Best for
BF16	2 bytes/param	★★★★	Final renders, maximum detail
FP8 (scaled)	1 byte/param	★★★☆	Good balance
FP8 (e4m3fn)	1 byte/param	★★★	Fast iteration, finding compositions

Quality ranking: bf16 > fp8_scaled > fp8_e4m3fn

With 24GB VRAM you can run BF16 variants of most models. The practical workflow I use:

Draft — fp8 model, 15–20 steps, find a good seed and composition
Final render — BF16 model, same seed, 35–50 steps

BF16 has FP32-like dynamic range (8-bit exponent) which means fewer NaN/overflow issues and better preservation of fine detail in hair, skin, and fabric. FP8 halves the VRAM requirement, which matters if you want to push to 720p or longer sequences.

If you see banding, posterisation, or loss of micro-detail, switch from fp8_e4m3fn to fp8_scaled or BF16.

Known Issues

Issue	Fix
`FramePack` fails to load — `__init__.py` not found	Use `kijai/ComfyUI-FramePackWrapper`, not `lllyasviel/FramePack`
`transformers==4.46.2` conflict when installing FramePack requirements	Install FramePackWrapper separately; don’t use FramePack’s `requirements.txt`
`uv pip install` — “No virtual environment found”	Use `--python O:\ComfyUI\.venv\Scripts\python.exe` explicitly
Browser 403 on `localhost:8188`	Use `http://127.0.0.1:8188` instead
OOM during 81-frame video generation	Ensure `PYTORCH_NO_HIP_MEMORY_CACHING=1` is set before launch

Lessons Learned

ROCm on Windows works now. AMD shipped ROCm 7.1 for Windows in late 2025. torch.cuda.is_available() returns True on RDNA3. No Zluda, no translation layer, no Linux required.
PYTORCH_NO_HIP_MEMORY_CACHING=1 is essential. Without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer video runs. This single env var saves roughly a third of VRAM.
Use kijai/ComfyUI-FramePackWrapper, not lllyasviel/FramePack. The original FramePack repo is a standalone Gradio app. It has no __init__.py and will fail to load as a ComfyUI custom node. The kijai wrapper is the correct one.
uv needs explicit --python flags when the venv is on a different drive. uv pip install looks for a venv relative to the current working directory. If your venv is on O: and you’re running from C:, it won’t find it. Pass --python O:\ComfyUI\.venv\Scripts\python.exe explicitly.
Don’t install FramePack’s standalone requirements.txt. It pins transformers==4.46.2, which conflicts with LTX-Video’s requirement for >=4.50.0. Install FramePackWrapper’s requirements separately — they’re clean.
BF16 for final renders, FP8 for drafts. With 24GB VRAM you have the headroom to run BF16 models. Use FP8 to find a good seed quickly, then switch to BF16 for the final high-step render.

Zero-Shot Voice Cloning on AMD — ROCm 7.1 on Windows, F5-TTS, and the ONNX Fallback

2026-04-04T22:00:00+01:00

AMD ROCm 7.1 now runs natively on Windows. Here’s how I used it to build a zero-shot voice cloning pipeline on a gaming machine that can’t switch to Linux.

The Setup

My main machine is a Windows gaming PC with an AMD RX 7900 XTX. I can’t switch to Linux because I play games with kernel-level anti-cheat — Riot Vanguard, EasyAntiCheat, BattlEye. These systems require Windows and won’t run under Wine, Proton, or any compatibility layer. Dual-booting is theoretically possible but kills any iterative AI workflow.

The goal: zero-shot voice cloning on GPU, on Windows, with AMD hardware.

Zero-shot means no fine-tuning — you give the model a short reference clip of any speaker, and it synthesises new speech in their voice. The model I chose is F5-TTS, a flow-matching TTS model that does this well and is fully open source.

The Journey (Short Version)

Before ROCm on Windows existed, I went through several dead ends:

torch-directml — DirectML doesn’t support ComplexFloat (FFT ops). F5-TTS uses STFT for mel spectrograms. Fatal incompatibility.
VMware PCIe passthrough — NOT_IMPLEMENTED on Windows hosts. Linux host required.
ROCm on Windows — didn’t exist. PyTorch ROCm wheels were Linux-only.
ZLUDA — CUDA compatibility layer for AMD. torch.stft explicitly broken.

The workaround I built was an ONNX + DirectML hybrid — export F5-TTS to three ONNX models, run the transformer on DirectML GPU and the FFT-heavy preprocessing/decode on CPU. It worked, but it was a compromise.

Then AMD shipped ROCm 7.1 for Windows.

ROCm 7.1 on Windows — The Real Solution

AMD’s HIP SDK for Windows is now available at repo.radeon.com, and PyTorch 2.9.0 ROCm wheels are included. torch.cuda.is_available() returns True on the RX 7900 XTX. The full pipeline — mel spectrogram, transformer, vocoder — runs on GPU.

Setting Up the ROCm Venv

Create a dedicated virtual environment (keep it separate from your main Python env):

# Use Python 3.12
python -m venv venv_rocm

Install the ROCm SDK and PyTorch from AMD’s repo:

.\venv_rocm\Scripts\python.exe -m pip install `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm-0.1.dev0.tar.gz `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_core-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_devel-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_libraries_custom-0.1.dev0-py3-none-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torch-2.9.0+rocmsdk20251116-cp312-cp312-win_amd64.whl `
  https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchaudio-2.9.0+rocmsdk20251116-cp312-cp312-win_amd64.whl

Install f5-tts and dependencies:

.\venv_rocm\Scripts\python.exe -m pip install f5-tts soundfile pydub pyyaml numpy

Verify the GPU is detected:

import torch
print(torch.__version__)          # 2.9.0+rocmsdk20251116
print(torch.cuda.is_available())  # True
print(torch.cuda.get_device_name(0))  # AMD Radeon RX 7900 XTX

Required Environment Variables

ROCm on Windows needs three env vars set before running. I put these in a launcher script:

# scripts/launch_voice_rocm.ps1
$env:PYTORCH_NO_HIP_MEMORY_CACHING = "1"   # saves ~1/3 VRAM, prevents OOM
$env:HIP_VISIBLE_DEVICES = "0"              # target RX 7900 XTX, ignore iGPU
$env:HSA_OVERRIDE_GFX_VERSION = "11.0.0"   # force gfx1100 (RDNA3) compatibility

PYTORCH_NO_HIP_MEMORY_CACHING=1 is particularly important — without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer runs.

Compatibility Patches

ROCm 7.1 + PyTorch 2.9 + f5-tts 1.1.18 required four patches to work together. None are fundamental issues — they’re version incompatibilities that will be fixed upstream:

File	Issue	Fix
`encodec/distrib.py`	`torch.distributed.ReduceOp` moved in PyTorch 2.9	`try/except` fallback
`torchaudio/__init__.py`	torchaudio 2.9 requires torchcodec (no Windows DLLs)	soundfile fallback
`f5_tts/model/cfm.py`	Sway sampling produces duplicate ODE timesteps	`torch.unique()`
`f5_tts/infer/utils_infer.py`	`ThreadPoolExecutor` causes tensor size mismatches	Sequential loop

The torchaudio patch is the most interesting — torchaudio 2.9 replaced its load() function with a torchcodec-only implementation, but torchcodec’s Windows DLLs don’t ship with the ROCm build. The fix is a one-line fallback to soundfile:

# torchaudio/__init__.py — patched load()
try:
    return load_with_torchcodec(uri, ...)
except (ImportError, OSError):
    import soundfile as _sf
    data, sample_rate = _sf.read(str(uri), dtype="float32", always_2d=True)
    return torch.from_numpy(data.T if channels_first else data), sample_rate

Running It

# Default (NFE=32, fast)
.\scripts\launch_voice_rocm.ps1

# Higher quality
.\scripts\launch_voice_rocm.ps1 --nfe 64

# Best quality
.\scripts\launch_voice_rocm.ps1 --nfe 128

The Architecture: Full GPU vs Hybrid

ROCm Native (Full GPU)

Reference Audio + Text
         │
         ▼
┌─────────────────────┐
│  Mel Spectrogram    │  ← ROCm GPU (STFT — works natively!)
│  Text Tokenisation  │
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  F5 Transformer     │  ← ROCm GPU (flow-matching, 32-128 steps)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Vocoder (Vocos)    │  ← ROCm GPU (mel → waveform)
└─────────────────────┘
         │
         ▼
      output.wav

Everything runs on GPU. No CPU↔GPU transfers between stages.

ONNX + DirectML (Hybrid Fallback)

Reference Audio + Text
         │
         ▼
┌─────────────────────┐
│  F5_Preprocess.onnx │  ← CPU (ComplexFloat/FFT — DirectML can't do this)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ F5_Transformer.onnx │  ← DirectML GPU (pure float ops — works fine)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│   F5_Decode.onnx    │  ← CPU (ISTFT/vocoder — same FFT issue)
└─────────────────────┘
         │
         ▼
      output.wav

The preprocessing and decode stages run on CPU because DirectML doesn’t support ComplexFloat (FFT). Only the transformer runs on GPU.

Reference Audio Pipeline

The quality of the output depends heavily on the reference clip. I built an ingest pipeline to automate finding and preparing good clips:

# scripts/ingest.py
# 1. Download from YouTube
ydl_opts = {
    "format": "bestaudio/best",
    "postprocessors": [{"key": "FFmpegExtractAudio", "preferredcodec": "wav"}],
}

# 2. Trim to the clean section
ffmpeg.input(raw_wav, ss=start_time, to=end_time) \
      .output(trimmed_wav, ar=22050, ac=1) \
      .run(overwrite_output=True)

# 3. Transcribe with Whisper
model = whisper.load_model("base")
result = model.transcribe(trimmed_wav)
transcript = result["text"].strip()

What makes a good reference clip:

6–30 seconds — long enough for voice characteristics, short enough to avoid drift
Clean audio — no background music, minimal reverb, no compression artefacts
Consistent delivery — don’t use a clip where the speaker is shouting or whispering

For Neil deGrasse Tyson (my test voice), I used an 11.9-second clip from a YouTube lecture, trimmed to a section with clean, energetic speech and no background noise.

The transcript must match the audio exactly — F5-TTS uses it to align voice conditioning. An accurate transcript noticeably improves output quality.

Configuration

Everything is driven by config.yaml:

voice:
  name: neil_degrasse_tyson
  audio_path: reference_audio/neil_degrasse_tyson/ndgt_ref_new.wav
  transcript: "So, here in the United States, we completely freaked out for
    multiple reasons. First, they beat us at something technological that
    they're not supposed to, because they're like communists."
  language: en

model:
  backend: f5_onnx_dml   # or f5_rocm via launch_voice_rocm.ps1
  onnx_model_dir: onnx_models/F5-TTS-ONNX-GPU-NFE128-CFG3
  nfe_step: 128
  speed: 0.75
  device_id: 0

output:
  output_dir: outputs/runs
  target_duration: 5.0
  silence_thresh_db: -40
  keep_raw: true

sentences:
  - "The universe is under no obligation to make sense to you."
  - "We are all connected — to each other, biologically; to the earth,
    chemically; to the rest of the universe, atomically."
  - "The good thing about science is that it's true whether or not you
    believe in it."

Machine-specific paths go in .env (not committed):

VOICE_GENERATOR_MODEL_DIR=C:\Users\joshu\...\onnx_models\F5-TTS-ONNX-GPU-NFE128-CFG3
VOICE_GENERATOR_OUTPUT_DIR=C:\Users\joshu\...\outputs\runs

Performance

Benchmarked on AMD RX 7900 XTX, 10-second reference clip, speed=0.75:

Backend	NFE	Precision	Time/clip	Notes
ONNX + DirectML	128	FP16	~33s	Stable, no SDK needed
ONNX + DirectML	256	FP32	~64s	Higher quality
ROCm native	32	FP32	~10s	3x faster than ONNX
ROCm native	64	FP32	~17s	Sweet spot
ROCm native	128	FP32	~30s	Best quality

The sweet spot is ROCm native at NFE=64 — 2x better quality than NFE=32, still 2x faster than ONNX+DirectML at equivalent NFE, and the quality improvement from 64→128 is marginal for most use cases.

At NFE=128, ROCm native (~30s) is roughly equivalent to ONNX+DirectML (~33s) in speed, but better in quality because the full pipeline runs in FP32 with no precision loss between stages.

Project Structure

voice_generator/
├── config.yaml                         # All settings
├── .env                                # Machine-specific paths (not committed)
├── .env.example                        # Template
├── scripts/
│   ├── generate_f5_rocm.py             # ROCm native backend
│   ├── generate_f5_onnx_dml.py         # ONNX+DirectML fallback
│   ├── launch_voice_rocm.ps1           # ROCm launcher (sets env vars)
│   ├── ingest.py                       # YouTube → trimmed WAV + transcript
│   └── transcribe.py                   # Whisper transcription
├── lib/
│   ├── audio.py                        # FFmpeg, normalisation, silence trim
│   ├── vocab.py                        # F5-TTS vocabulary handling
│   └── config.py                       # Config dataclasses + loader
├── venv_rocm/                          # ROCm Python environment
├── onnx_models/
│   ├── F5-TTS-ONNX-GPU-NFE128-CFG3/   # ONNX FP16 (DirectML)
│   └── F5-TTS-ONNX-GPU-FP32-NFE256/   # ONNX FP32 (DirectML)
├── reference_audio/
│   └── neil_degrasse_tyson/
│       └── ndgt_ref_new.wav            # 11.9s reference clip
├── outputs/runs/                       # Generated audio
└── tests/                              # 78 pytest unit tests

The ONNX + DirectML Fallback

If you don’t want to install the full ROCm SDK (~3.5GB), the ONNX + DirectML approach still works well. It requires only standard AMD Adrenalin drivers and ONNX Runtime with the DirectML execution provider.

The ONNX models are exported from F5-TTS with NFE and CFG baked in:

# onnx_export/Export_F5.py
use_fp16_transformer = True   # FP16 for DirectML
NFE_STEP = 128
CFG_STRENGTH = 3.0
OUTPUT_DIR = r"onnx_models\F5-TTS-ONNX-GPU-NFE128-CFG3"

The transformer runs on DirectML GPU, preprocessing and decode run on CPU:

# DirectML for transformer
ort_session_b = onnxruntime.InferenceSession(
    "F5_Transformer.onnx",
    providers=["DmlExecutionProvider"],
)

# CPU for preprocessing and decode
ort_session_a = onnxruntime.InferenceSession(
    "F5_Preprocess.onnx",
    providers=["CPUExecutionProvider"],
)

When to use ONNX + DirectML:

You don’t want to install the 3.5GB ROCm SDK
You need to run on a non-AMD GPU (NVIDIA, Intel — DirectML works on all DirectX 12 GPUs)
You want FP16 precision to save VRAM
You need a more stable, less patchy setup

Test Suite

The project has 78 pytest unit tests:

tests/
├── test_lib_audio.py       # 19 tests
├── test_lib_vocab.py       # 18 tests
├── test_lib_config.py      # 22 tests
├── test_integration_smoke.py  # GPU required
└── test_e2e_full_run.py       # GPU required

pytest tests/ -v          # 78 unit tests, no GPU needed
pytest tests/ -m integration  # requires DirectML GPU
pytest tests/ -m e2e          # full pipeline test

Lessons Learned

ROCm on Windows works now. AMD shipped ROCm 7.1 for Windows in late 2025. torch.cuda.is_available() returns True on RDNA3. The ecosystem is still maturing but it’s functional.
The ONNX hybrid is still worth knowing. If you don’t want the ROCm SDK overhead, or you need to run on non-AMD hardware, ONNX + DirectML is a solid fallback that works on any DirectX 12 GPU.
NFE=64 is the sweet spot for ROCm native. 2x better quality than NFE=32, still 2x faster than ONNX+DirectML, and the marginal quality gain from 64→128 rarely justifies the 2x time cost.
Reference audio quality matters more than model parameters. A clean 12-second clip beats a noisy 30-second clip every time. Get the transcript right — it directly affects voice conditioning quality.
PYTORCH_NO_HIP_MEMORY_CACHING=1 is essential. Without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer runs. This env var saves roughly a third of VRAM.
Separate config from machine-specific paths. Using .env for absolute paths means the same config.yaml works on any machine without modification.

Optimizing Proxmox Backup Server with S3: Regional Migration and Fixing a Glacier Misconfiguration

2026-04-04T13:00:00+01:00

How I migrated my PBS S3 datastore to a closer regional endpoint, resolved a Glacier lifecycle misconfiguration, and properly optimized the setup

The Most Important Thing First: Glacier is Incompatible with PBS

Before anything else — if you’re running Proxmox Backup Server with an S3 backend, do not use Glacier lifecycle policies. This includes Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive.

PBS needs immediate, on-demand access to chunks for garbage collection, verification, deduplication, and restores. Glacier storage classes require retrieval requests that can take anywhere from milliseconds to 48 hours depending on tier. The moment PBS tries to access a Glaciered chunk, it fails. This breaks GC, verification, and restores silently or with cryptic errors.

The correct storage class for PBS S3 is S3 Intelligent-Tiering — it automatically moves infrequently accessed data to cheaper tiers, but everything remains immediately accessible with no retrieval delays or fees.

Background

I run a Proxmox homelab with multiple PVE nodes and PBS servers. One of my PBS servers uses AWS S3 as a backend for offsite backups. PBS 4.x supports S3 as a “technology preview” feature — it uses a local cache disk and syncs chunks to S3.

The setup had been running for several months and had accumulated a number of issues:

Intermittent connection errors (“bytes remaining on stream”, “Transport endpoint not connected”)
The S3 cache disk was growing without bound
S3 costs were higher than expected due to Glacier retrieval fees

I decided to do a thorough investigation and fix everything properly.

The Investigation

Infrastructure Overview

Component	Details
PBS Server	VM on Proxmox
S3 Backend	AWS S3
Cache Disk	850 GB ext4

Key Findings

1. Wrong Regional Endpoint The PBS server and the S3 bucket were in different regions. Every S3 API call was incurring unnecessary cross-region latency. With millions of small chunk objects, this latency compounds significantly — S3 is a high-request-count workload.

2. Glacier Lifecycle Disaster A lifecycle policy was transitioning objects through Glacier tiers:

Day 14 → Glacier Instant Retrieval
Day 104 → Glacier Flexible Retrieval
Day 194 → Glacier Deep Archive

As covered above, this is fundamentally incompatible with PBS. It was silently breaking GC and verification, and would have made restores impossible for older backups.

3. Unbounded Cache Growth The 850 GB cache disk was 65% full with 1.67M chunk files across 65,536 subdirectories. PBS docs recommend only 64–128 GiB for the cache.

Cache breakdown:

~71% of chunks were 0-byte marker files (cache index markers)
~29% contained actual cached data
Chunks from months ago were still in the cache
No automatic cache eviction exists in this PBS version

4. TCP Keepalive Too Slow Default tcp_keepalive_time was 7200 seconds (2 hours). Dead S3 connections weren’t detected for hours, causing the “Transport endpoint not connected” errors. High latency to a distant S3 region made this worse — more connections timing out silently.

5. Ext4 Wasted Space The cache disk had 4.18% reserved blocks — about 37 GB wasted on a disk where root reservation serves no purpose.

6. GC Schedule Needed Review Garbage collection frequency needs careful consideration with S3 backends — every GC run makes a large number of LIST and HEAD API calls against S3, which cost money. Running GC too frequently wastes money; too infrequently leaves orphaned chunks accumulating. Weekly is a reasonable balance for most setups.

7. S3 Endpoint Style Using path-style addressing (s3.amazonaws.com/bucket/key) instead of the recommended vhost-style (bucket.s3.region.amazonaws.com/key).

The Optimizations

Phase 1: No-Downtime Changes

1. GC Schedule

proxmox-backup-manager datastore update pbs-s3 --gc-schedule "sat 02:00"

Weekly GC on Saturday at 2am. Frequent enough to keep orphaned chunks in check, infrequent enough to keep S3 API costs reasonable.

2. Ext4 Reserved Blocks: 4.18% → 1%

tune2fs -m 1 /dev/sdc

Freed ~28 GB immediately. No reason to reserve 37 GB for root on a cache disk.

3. TCP Keepalive Tuning

cat > /etc/sysctl.d/99-s3-tuning.conf << EOF
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_fin_timeout = 30
EOF
sysctl -p /etc/sysctl.d/99-s3-tuning.conf

Dead S3 connections now detected in ~2 minutes instead of 2 hours. Essential when connection latency is non-trivial.

4. Ext4 Mount Options

Updated /etc/fstab:

UUID= /mnt/S3BackupCache ext4 noatime,commit=120 0 2

noatime — Eliminates metadata writes on every access across 1.67M files
commit=120 — Reduces journal commit frequency (cache is reconstructible from S3)
UUID-based mount for stability across disk reorders

5. S3 Endpoint: Path-style → Vhost-style

proxmox-backup-manager s3 endpoint update pbs-s3 \
    --endpoint '.s3..amazonaws.com' \
    --region  \
    --delete path-style

Direct regional routing rather than the global endpoint.

Phase 2: Restart Required

6. RAM: Increased

More RAM means better filesystem caching for 1.67M chunk files — the OS page cache can hold more of the chunk index in memory.

The Migration: Moving to a Closer Region

The Problem

The PBS server and S3 bucket were in different regions. Every backup chunk upload and every GC/verification API call was crossing region boundaries. This was the root cause of the elevated latency and connection instability.

Step 1: Create New Bucket in the Correct Region

aws s3api create-bucket \
    --bucket  \
    --region  \
    --create-bucket-configuration LocationConstraint=

Configured with:

S3 Intelligent-Tiering lifecycle (no Glacier)
Server-side encryption
Randomized bucket name for security

Step 2: Restore Glacier Objects

The biggest challenge — over 860,000 objects were in Glacier or Deep Archive and needed to be restored before they could be copied.

Storage Class	Objects
STANDARD	~828,000
GLACIER_IR	~212,000
GLACIER	~433,000
DEEP_ARCHIVE	~430,000

First Attempt: Individual API Calls (Too Slow)

Started with parallel aws s3api restore-object calls. At ~1–2 seconds per call with 860K objects, this would have taken days.

Solution: S3 Batch Operations

Used S3 Batch Operations to restore all Glacier objects server-side:

Generated a CSV manifest of all Glacier objects
Created an IAM role for batch operations
Submitted the batch job via the AWS console

Result: ~810,000 succeeded, ~51,000 “failed” with RestoreAlreadyInProgress (from our earlier individual attempts — not real failures). Completed in ~2 hours entirely on AWS infrastructure.

Step 3: Copy Data to New Region

Standard Objects (`aws s3 sync`)

aws s3 sync s3:// s3:// \
    --region  \
    --source-region  \
    --storage-class INTELLIGENT_TIERING

However, aws s3 sync refuses to copy objects with GLACIER storage class — even after they’ve been restored.

Glacier Objects (boto3)

Used Python boto3 to copy the restored Glacier objects:

from concurrent.futures import ThreadPoolExecutor
import boto3

s3_dst = boto3.client('s3', region_name='')

def copy_one(key):
    s3_dst.copy_object(
        Bucket='',
        Key=key,
        CopySource={'Bucket': '', 'Key': key},
        StorageClass='INTELLIGENT_TIERING'
    )

with ThreadPoolExecutor(max_workers=20) as executor:
    executor.map(copy_one, glacier_keys)

Result: ~833,000 objects copied, 0 failures. ✅

Step 4: Switch PBS to New Bucket

# Maintenance mode
proxmox-backup-manager datastore update pbs-s3 \
    --maintenance-mode 'type=offline,message="Migrating region"'

# Update endpoint region
proxmox-backup-manager s3 endpoint update pbs-s3 --region 

# Update bucket name in config
sed -i 's/bucket=/bucket=/' \
    /etc/proxmox-backup/datastore.cfg

# Verify connectivity
proxmox-backup-manager s3 check pbs-s3 

# Remove maintenance mode
proxmox-backup-manager datastore update pbs-s3 --delete maintenance-mode

Step 5: Full Verification

proxmox-backup-manager verify-job update  --ignore-verified false
proxmox-backup-manager verify-job run 

Lifecycle Policy: The Right Way

❌ Wrong (What I Had)

Day 0   → S3 Standard
Day 14  → Glacier Instant Retrieval
Day 104 → Glacier Flexible Retrieval
Day 194 → Glacier Deep Archive

This breaks PBS completely — GC, verification, dedup, and restores all require immediate chunk access.

✅ Correct

{
    "Rules": [{
        "ID": "pbs-intelligent-tiering",
        "Status": "Enabled",
        "Filter": {},
        "Transitions": [{
            "Days": 1,
            "StorageClass": "INTELLIGENT_TIERING"
        }]
    }]
}

S3 Intelligent-Tiering automatically moves infrequently accessed data to cheaper tiers, but everything remains immediately accessible with no retrieval fees or delays.

Cache Disk Shrink

After migration, the cache disk was shrunk from 850 GB to 128 GiB:

Add new smaller disk to the VM
Put datastore in maintenance mode, stop proxy
Format new disk: mkfs.ext4 -L S3BackupCache /dev/sdX && tune2fs -m 1 /dev/sdX
Update /etc/fstab with UUID of new disk
Mount, start proxy, remove maintenance mode
Run proxmox-backup-manager datastore s3-refresh pbs-s3 — this pulls all manifest/index files from S3 so existing backups become visible in the new cache
Remove old disk

Important: After replacing the cache disk, run s3-refresh. The new disk starts empty — PBS won’t know about existing S3 backups until the manifests are downloaded. This is a one-time operation.

Before & After

Metric	Before	After
S3 Region	Distant region	Closer regional endpoint
API Latency	High	Low
Endpoint Style	path-style	vhost-style
Lifecycle	Glacier cascade	Intelligent-Tiering
GC Frequency	Monthly	Weekly
TCP Keepalive	2 hours	60 seconds
Mount Options	defaults	noatime,commit=120
Reserved Blocks	4.18% (37 GB wasted)	1%
Cache Disk	850 GB (unbounded)	128 GiB
Connection Errors	Frequent	Gone
Backup Performance	Unoptimised	Optimised

Lessons Learned

Never use Glacier lifecycle policies with PBS S3. PBS needs immediate access to all chunks. Use Intelligent-Tiering instead. Check this before doing anything else.
S3 region matters. Put the bucket in the same or closest available region to the PBS server. Cross-region latency compounds badly with high object counts.
GC frequency vs. S3 API cost is a real tradeoff. Every GC run makes thousands of API calls. Don’t run it more frequently than necessary — weekly is a good default for most homelab setups.
TCP keepalive tuning is critical for S3. The default 2-hour timeout means dead connections go undetected. With any meaningful latency, this causes intermittent backup failures.
The PBS S3 cache needs deliberate sizing. 64–128 GiB is recommended. An oversized cache disk just fills with stale data and is never evicted.
After replacing the cache disk, run s3-refresh. The new disk starts empty — existing S3 backups won’t be visible until manifests are downloaded.
aws s3 sync won’t copy GLACIER-class objects even when restored. Use boto3 copy_object() for those.
ext4 noatime is essential with millions of small files. Every read normally updates access time metadata — eliminating this overhead makes a noticeable difference.

Tags: proxmox, pbs, s3, aws, glacier, backup, optimization, homelab

Tuning Open WebUI + AWS Bedrock for Complex AI Workflows — Timeouts, Code Execution, and Custom Patches

2026-03-28T00:00:00+00:00

My self-hosted AI setup runs Open WebUI backed by AWS Bedrock via a custom gateway. Simple queries work fine. But complex workflows — sub-agents making dozens of tool calls, web searches, and code execution — kept timing out, dropping connections, or just hanging indefinitely.

This post covers the full diagnosis and every customisation I’ve made, including the trade-offs and drawbacks of each one.

🏗️ The Architecture

Browser → Open WebUI (Docker)
              ↓
         Bedrock Gateway (Docker, internal network)
              ↓
         AWS Bedrock API (eu-west-2)
              ↓
         SearXNG (web search) / Tika (document parsing) / Jupyter (code execution)

Six Docker containers on a shared bridge network, all communicating internally. Open WebUI is the only container with an exposed port. The Bedrock gateway translates OpenAI-compatible API calls into AWS Bedrock’s ConverseStream format, with cross-region inference enabled so models appear with global.* prefixes and route automatically.

🐛 The Problem

Complex queries with sub-agents or code execution would fail in three ways:

WebSocket drops — the browser connection would silently die mid-response
Code execution hangs — Python code blocks would take 30+ seconds or never return
Bedrock validation errors — tool-use conversations would hit 400 Bad Request after many iterations

Simple one-shot queries worked perfectly. The failures only surfaced during multi-turn, tool-heavy workflows.

🔍 The Investigation

WebSocket Keepalive Failures

The Open WebUI logs showed repeated errors:

keepalive ping failed
AssertionError
  File "websockets/legacy/protocol.py", line 308, in _drain_helper
    assert waiter is None or waiter.cancelled()

This is a known bug in websockets v16.0 — the library’s legacy protocol throws an AssertionError when trying to send a ping to a connection that’s mid-drain. During complex queries, the server is busy processing tool calls and can’t respond to WebSocket pings in time.

The default WEBSOCKET_SERVER_PING_TIMEOUT is 20 seconds. A single sub-agent iteration with web search, code execution, and LLM response easily exceeds that.

Code Execution Round-Trip

Open WebUI’s default code execution engine is Pyodide — a WebAssembly Python runtime that runs in the browser. The execution path for every code block is:

Server → WebSocket event → Browser → Pyodide WASM → Browser → WebSocket → Server → Bedrock API

Every code block makes a full round-trip through the browser’s WebSocket connection. On a multi-step sub-agent workflow running 3-5 code blocks, this adds 30-60 seconds of pure overhead — and if the WebSocket drops mid-execution, the entire workflow fails silently.

Bedrock Validation Errors

Two specific errors appeared in the gateway logs during long conversations:

ValidationException: The toolConfig field must be defined when using
toolUse and toolResult content blocks.

ValidationException: prompt is too long: 2,084,831 tokens > 1,000,000 maximum

The first indicates tool configuration wasn’t being forwarded properly on follow-up turns. The second shows conversation history accumulating past Bedrock’s 1M token context window — a natural consequence of sub-agents that generate hundreds of tool call results.

API Latency

The Bedrock gateway was configured to use us-east-1 (Virginia). Every API call — and there are dozens per sub-agent workflow — was crossing the Atlantic and back. With the server physically located in the UK, this added 100-200ms per request, compounding across multi-turn conversations.

🛠️ The Fixes

Fix 1: Increase WebSocket and HTTP Timeouts

Three environment variables on the Open WebUI container:

-e WEBSOCKET_SERVER_PING_TIMEOUT=120    # Was 20s — prevents keepalive failures
-e WEBSOCKET_EVENT_CALLER_TIMEOUT=600   # Was 300s — allows longer tool chains
-e AIOHTTP_CLIENT_TIMEOUT=600           # Was 300s — prevents HTTP client timeouts

Why: The defaults assume short request-response cycles. Sub-agent workflows with tool calls, web searches, and code execution routinely exceed 5 minutes end-to-end.

Drawback: Higher timeouts mean genuinely broken connections take longer to detect. A hung request will now sit for 10 minutes before timing out, consuming a server thread the entire time. On a resource-constrained system, this could become a problem under concurrent usage.

Fix 2: Server-Side Code Execution with Jupyter

Replaced the browser-side Pyodide engine with a server-side Jupyter notebook container:

services:
  jupyter:
    image: jupyter/scipy-notebook:latest
    container_name: jupyter
    restart: always
    environment:
      - JUPYTER_TOKEN=
    command: start-notebook.py --NotebookApp.allow_origin='*' --NotebookApp.ip='0.0.0.0'
    networks:
      - ai-services

Open WebUI configured with:

-e CODE_EXECUTION_ENGINE=jupyter
-e CODE_INTERPRETER_ENGINE=jupyter
-e CODE_EXECUTION_JUPYTER_URL=http://jupyter:8888
-e CODE_INTERPRETER_JUPYTER_URL=http://jupyter:8888
-e CODE_EXECUTION_JUPYTER_AUTH=token
-e CODE_INTERPRETER_JUPYTER_AUTH=token
-e CODE_EXECUTION_JUPYTER_AUTH_TOKEN=
-e CODE_INTERPRETER_JUPYTER_AUTH_TOKEN=
-e CODE_EXECUTION_JUPYTER_TIMEOUT=60
-e CODE_INTERPRETER_JUPYTER_TIMEOUT=60

The execution path is now:

Server → Jupyter HTTP API → Server

No browser round-trip, no WebSocket dependency, and scipy-notebook ships with NumPy, pandas, matplotlib, and SciPy pre-installed.

Why: Eliminates the browser round-trip entirely. Code execution drops from 10-30 seconds to 1-3 seconds. The Jupyter kernel persists state across code blocks within a session, so variables and imports carry over.

Drawback: The jupyter/scipy-notebook image is ~1.5GB and uses significant RAM. On a memory-constrained system, this adds pressure. The Jupyter server also has full access to the Docker network — any code the LLM generates runs server-side with the same network access as every other container. This is a real security consideration for multi-user deployments.

Fix 3: Move Bedrock to eu-west-2 (London)

Changed the gateway’s AWS region from us-east-1 to eu-west-2:

environment:
  - AWS_REGION=eu-west-2

With cross-region inference enabled, global.* model prefixes automatically route to the nearest available capacity.

Why: Reduces API latency by ~100-200ms per request. Over a 20-turn sub-agent workflow, that’s 2-4 seconds saved — and more importantly, fewer timeout-inducing delays.

Drawback: If a specific model isn’t available in eu-west-2, the cross-region routing adds its own overhead. Model availability can vary by region, though with global.* prefixes this is mostly transparent.

🔬 Custom Code Patches

I maintain three patched files that are bind-mounted into the containers, overriding upstream code. Each one exists to solve a specific problem, but they all come with maintenance costs.

Patch 1: Empty Model Cache Guard (`models.py`)

The problem: When the Bedrock gateway is temporarily unreachable, Open WebUI’s model list refresh returns empty. The upstream code caches this empty result, causing every subsequent request to fail with “Model not found” until the next successful refresh. During sub-agent workflows where the model list is re-checked between tool calls, this creates a cascade of failures.

The fix:

# Only update the cache if we got a non-empty model list
if models_dict:
    if isinstance(request.app.state.MODELS, RedisDict):
        request.app.state.MODELS.set(models_dict)
    else:
        request.app.state.MODELS = models_dict
else:
    log.warning('get_all_models() returned empty model list, keeping previous cache')

Same pattern applied to BASE_MODELS.

Drawback: If a model is genuinely removed from Bedrock, the stale cache will continue serving it until a successful refresh eventually returns the updated list. This could cause confusing errors if a user selects a model that exists in cache but no longer exists upstream.

Patch 2: Default Feature Flags (`middleware.py`)

The problem: Open WebUI requires users to manually enable web search and memory recall per-chat. For a single-user setup where you always want these features, this is friction.

The fix:

features = form_data.pop('features', None) or {}
features.setdefault('web_search', True)
features.setdefault('memory', True)

Drawback: Every single chat now triggers a web search — even for simple “hello” messages. This adds 2-5 seconds of latency to every response, increases API costs (SearXNG queries + RAG processing), and occasionally returns irrelevant search results that confuse the model. Memory retrieval runs on every message too, adding its own overhead.

Patch 3: Default max_tokens (`middleware.py`)

The problem: Without an explicit max_tokens, some Bedrock models default to very low token limits, causing truncated responses. This is particularly harmful for tool-use scenarios where the model needs to output complete JSON for function call arguments.

The fix:

if 'max_tokens' not in form_data:
    form_data['max_tokens'] = 16384

Drawback: Higher token limits increase API costs per request. A 16K token limit means every single request — including short yes/no answers — is budgeted for 16K tokens of output. The cost impact is real but manageable for single-user usage.

Patch 4: Bedrock Gateway Model Caching (`model_patched.py`)

The problem: The upstream Bedrock gateway calls AWS’s ListFoundationModels and ListInferenceProfiles APIs on every single /models request. These are synchronous boto3 calls that block the async event loop and take 1-3 seconds each.

The fix:

_cached_models = None
_cache_timestamp = 0
_CACHE_TTL = 300  # 5 minutes

def _get_models_cached():
    global _cached_models, _cache_timestamp
    now = time.time()
    if _cached_models is not None and (now - _cache_timestamp) < _CACHE_TTL:
        return _cached_models
    try:
        models = chat_model.list_models()
        _cached_models = models
        _cache_timestamp = now
        return models
    except Exception:
        if _cached_models is not None:
            return _cached_models  # Stale cache on error
        raise

Also wrapped in run_in_threadpool to prevent event loop blocking.

Drawback: New models deployed to Bedrock won’t appear for up to 5 minutes. There’s no cache invalidation mechanism — the only way to force a refresh is to restart the gateway container. The global mutable state could theoretically have race conditions under high concurrency.

⚠️ The Cost of Custom Patches

All four patches are applied via Docker bind mounts — the patched files are stored on the host and mounted over the container’s originals at startup. This means:

Watchtower updates don’t break the patches — the bind mounts persist across image updates
But upstream API changes can break everything — if an Open WebUI update changes internal function signatures that the patches depend on, the container will crash on startup
Version drift accumulates — the longer you maintain patches, the harder it becomes to merge upstream improvements

I originally maintained a fully pinned middleware.py (all 4,887 lines), but the drift became unsustainable. The pinned version was missing over a dozen upstream fixes including strip_empty_content_blocks() (which prevents Claude/Gemini errors), merge_system_messages() (which prevents template parsing failures), and proper done: True completion markers.

The current approach is better: start from the latest upstream, apply minimal targeted patches. The four patches above total ~20 lines of actual changes. When upstream updates, re-extracting the base files and re-applying the patches takes minutes, not hours.

📊 Results

Metric	Before	After
Sub-agent success rate	~60% (intermittent drops)	~100%
Code execution time	10-30s per block (Pyodide)	1-3s per block (Jupyter)
WebSocket “keepalive ping failed”	Every few minutes	Rare (idle connections only)
Bedrock API latency	~200ms (us-east-1)	~50ms (eu-west-2)
Custom patch maintenance	4,887-line pinned file	~20 lines across 3 files

💡 Lessons Learned

1. Pyodide Is the Wrong Tool for Server-Side AI Workflows

Browser-based code execution makes sense for interactive notebooks. For autonomous AI agents running multi-step code workflows, the WebSocket round-trip is a dealbreaker. Jupyter is heavier but eliminates an entire class of failure modes.

2. Default Timeouts Assume Simple Conversations

Most AI UIs are designed for single-turn Q&A. When you add sub-agents, tool calls, web search, code execution, and RAG — all in a single conversation turn — the default 20-second WebSocket ping timeout is laughably short. Know your workload and set timeouts accordingly.

3. Maintain Patches, Not Forks

Pinning an entire 5,000-line file to avoid upstream breakage feels safe, but it’s a trap. You lose every upstream bugfix and improvement. Minimal, targeted patches that can be re-applied to fresh upstream files are far more sustainable.

4. Every Customisation Has a Cost

Defaulting web search to “always on” sounds great until every trivial question adds 3 seconds of latency. Setting max_tokens=16384 prevents truncation but increases API costs. Server-side Jupyter execution is fast but widens the attack surface. Document the trade-offs, not just the benefits.

5. Cache Defensively

Never replace good data with empty data. Whether it’s model lists, DNS caches, or configuration stores — if the upstream source is temporarily unavailable, serving stale data is almost always better than serving nothing.

I Audited Every VM in My Homelab — Here’s What I Found (and Fixed)

2026-03-26T12:00:00+00:00

My homelab has been running for a couple of years now. Three Proxmox hosts, 27 VMs, a mix of blockchain validators, DNS, monitoring, backup servers, and various projects I’ve spun up and half-forgotten about. It works, mostly. But I’d never actually sat down and audited everything — checking what’s over-provisioned, what’s under-monitored, what’s running outdated software, and what’s one bad day away from a disk-full meltdown.

So I did exactly that. And it wasn’t pretty.

The Audit

The audit covered every running VM across all three hosts, pulling data from Proxmox configs (qm config), Zabbix API metrics, and in-VM checks via qm guest exec. For each VM, I checked CPU and RAM utilisation against what was allocated, disk usage, backup coverage, monitoring status, guest agent health, OS version, and hardware configuration.

The Scary Findings

A Disk About to Explode

My AI server was sitting at 81% disk usage with only 5GB free on a 32GB disk. It was one large model download away from grinding to a halt.

# The fix was straightforward
qm disk resize 3465 scsi0 +16G                    # Proxmox side
qm guest exec 3465 -- growpart /dev/sda 3          # Grow partition
qm guest exec 3465 -- pvresize /dev/sda3           # Extend PV
qm guest exec 3465 -- lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
qm guest exec 3465 -- resize2fs /dev/ubuntu-vg/ubuntu-lv

32GB to 48GB. Usage dropped from 81% to 49%. Crisis averted.

Four VMs With Broken Monitoring

I had 4 VMs registered in Zabbix that were returning zero for every metric. They showed as “monitored” in the dashboard, but the Zabbix agent wasn’t actually running inside any of them. The hosts existed in Zabbix, the agent was installed, but the service was dead — so every graph was a flat line at zero.

If any of those VMs had a problem, I’d have had no alert. The fix was reinstalling zabbix-agent2 v7.0 from the official repo on all four, configuring the Zabbix server address, restarting the service, and verifying data started flowing through the Zabbix API.

Ghost VMs

Two VMs had no QEMU guest agent at all — meaning Proxmox couldn’t cleanly shut them down, couldn’t run commands inside them, and couldn’t even see their IP addresses. One of them was stopped with no onboot flag, so it wouldn’t even survive a host reboot.

Getting the guest agent onto a stopped VM with no SSH access required mounting its raw disk on the host, injecting an SSH key, starting it, and then installing the agent. Not fun, but it worked.

VMs Still on Ubuntu 22.04

Seven of my Ubuntu VMs were still on 22.04 Jammy. Not end-of-life yet, but approaching standard support end. I’d been putting off the upgrades because doing them one-by-one is tedious and risky if something breaks. More on how I batched these later.

PBS Servers Without Unattended-Upgrades

My three Proxmox Backup Server instances — the systems responsible for protecting everything else — didn’t have unattended-upgrades configured. Now, these servers are patched regularly by my Ansible update playbook, so they weren’t actually unpatched. But Ansible runs on a schedule, and there’s always a gap between a critical CVE dropping and the next playbook run. Adding unattended-upgrades as a safety net means security patches get applied daily regardless of when Ansible runs next — belt and suspenders.

# The key is not just installing the package, but configuring it to actually run
cat > /etc/apt/apt.conf.d/20auto-upgrades << 'EOF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::AutocleanInterval "7";
EOF

VMs on Directory Storage

Six VMs had their disks stored as qcow2/raw files on directory storage instead of LVM-thin. This means worse I/O performance, no thin provisioning, and more overhead. Most of my other VMs were already on LVM-thin — these were just stragglers from older deployments.

A Backup Job Pointing to a Deleted VM

One of the backup jobs was referencing a VMID that doesn’t exist anymore. Meanwhile, a VM that I’d recently created wasn’t in any backup job. Classic.

The Remediation

Here’s what I actually did, roughly in order of priority:

Critical

Expanded the AI server disk from 32GB to 48GB (live, no downtime)
Fixed 4 broken Zabbix agents — reinstalled zabbix-agent2 v7.0, configured the Zabbix server address, verified data flow through the API
Installed guest agents on 2 VMs that were previously unmanageable
Added an unmonitored VM to Zabbix — it had no monitoring at all

High Priority

Fixed backup jobs — added the missing VM and removed the ghost VMID
Configured unattended-upgrades on all 3 PBS VMs as a safety net alongside Ansible

Medium Priority

Fixed boot orders on 5 VMs — removed unnecessary PXE boot entries that were slowing down startup
Reduced CPU allocations on 3 over-provisioned VMs (one had 6 cores on an 8-core host at 3% usage)
Added iothread to 2 VMs that were missing it. In Proxmox, enabling iothread on a virtio-scsi disk offloads I/O processing to a dedicated thread instead of sharing the main vCPU thread. This reduces latency and improves throughput, especially under heavy disk load. It’s a free performance win with no downside — the only catch is it requires a brief VM restart to apply:

qm set 401 --scsi0 :vm-401-disk-0,iothread=1,size=32G,ssd=1
qm set 555 --scsi1 :vm-555-disk-0,iothread=1,size=100G

Migrated 6 disks from directory storage to LVM-thin (live migration, no downtime):

# Move from local qcow2 to LVM-thin — live, no VM shutdown needed
qm disk move 239 scsi0 CRUCIAL_SSD1 --delete 1
qm disk move 404 scsi0 CRUCIAL_SSD1 --delete 1
qm disk move 4070 scsi0 usb-crucial-ssd-1 --delete 1

The Big One: Batch OS Upgrades

Seven VMs needed upgrading from Ubuntu 22.04 to 24.04. Rather than doing them one at a time over several weeks, I decided to batch them all at once. The reasoning: if the upgrade process has a systemic issue (like a broken package or incompatible config), I’d rather find out across all VMs simultaneously and fix it once, than discover it seven separate times.

The trick was deploying an upgrade script to each VM via qm guest exec (base64 encoded to avoid quoting hell), then launching it as a systemd transient service so it persists after the guest exec connection drops:

# Deploy script via guest agent (base64 avoids shell quoting nightmares)
qm guest exec $vmid -- bash -c 'echo  | base64 -d > /root/do-upgrade.sh && chmod +x /root/do-upgrade.sh'

# Launch as a persistent service that survives the guest exec timeout
qm guest exec $vmid -- systemd-run --unit=os-upgrade /root/do-upgrade.sh

The script itself:

#!/bin/bash
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
do-release-upgrade -f DistUpgradeViewNonInteractive
reboot

I launched all seven simultaneously across both hosts and monitored for failures. Six of seven upgraded cleanly. One needed a dpkg repair via chroot after the upgrade interrupted mid-package-install — nothing a dpkg --configure -a couldn’t fix.

After the upgrades, one VM triggered a Zabbix disk space alert — the OS upgrade consumed enough extra space to push it over 80%. Turned out the 32GB virtual disk only had 15GB allocated to LVM with 17GB sitting unused. A quick lvextend and resize2fs sorted it without even needing to resize the virtual disk.

Cleanup

Archived 20 stale Zabbix hosts — old VMs, deleted devices, test entries. Tagged them archived=true via the API rather than deleting, in case I need to reference the historical data.
Added missing tags to VM configs for consistency
Fixed backup job references — removed non-existent VMIDs and added newly created VMs

Lessons Learned

Audit regularly. Technical debt compounds silently. Four VMs with broken monitoring could have been months of invisible outages.
Don’t put VM disks on directory storage. LVM-thin is better in almost every way — thin provisioning, better I/O, proper snapshot support. Reserve local for ISOs and templates.
systemd-run is your friend. When you need to launch a long-running process via qm guest exec that would otherwise time out, systemd-run --unit=name /path/to/script creates a persistent service that survives the connection drop.
Unattended-upgrades needs configuration, not just installation. The package alone does nothing — you need the 20auto-upgrades and 50unattended-upgrades config files with the right origins. Even if you have Ansible handling updates, it’s worth having as a safety net.
Batch your upgrades and monitor for failures. Doing OS upgrades one-by-one across weeks means you discover the same issues seven times. Batching them lets you catch systemic problems early and fix them once.
Base64 encode scripts when passing them through multiple layers of SSH/shell quoting. Saves hours of escaping hell.

What’s Left

PBS-S3 optimisation — My S3-backed PBS datastore kept dropping connections under load during the pre-flight backups. Needs a separate deep dive into cache management and retention policies.

Final State

27 VMs audited. 15 remediation steps executed and verified. 7 OS upgrades. 6 disk migrations. 4 monitoring fixes. 20 stale hosts archived. Zero data lost.

Outlook Classic Not Syncing New Gmail Folders

2026-03-26T12:00:00+00:00

A friend had their Gmail account set up in Outlook Classic on Windows using IMAP/SMTP. The problem: whenever they created new folders or labels in Gmail’s web UI, they’d show up on their iPhone and iPad immediately, but never in Outlook. I’d previously fixed it for them by manually editing the IMAP subscribed folders list, but didn’t want to keep doing that every time they created a new label.

What I Tried

First, the obvious: unchecking “When displaying hierarchy in Outlook, show only subscribed folders” in the IMAP Folders dialog. Didn’t help on its own.

Querying folders in the IMAP Folders dialog confirmed the missing folder existed on the server — Outlook could see it was there. But none of the usual tricks worked:

Send/Receive — no change
Collapsing and expanding the folder tree — no change
Restarting Outlook — no change
Unsubscribing and resubscribing — no change

The folder was on the server. Outlook knew it was there. It just refused to display it.

The Fix

Renaming the Gmail OST file with a .bak extension and relaunching Outlook forced a complete resync from the IMAP server. When Outlook starts and can’t find its OST file, it creates a new one and pulls everything down fresh. This was the only thing that reliably brought in the new folders.

The OST file lives at:

%localappdata%\Microsoft\Outlook\

Automating It

Rather than having them manually rename the file every time, I wrote a batch script that does it on startup:

@echo off
:: Kill Outlook if running
taskkill /f /im OUTLOOK.EXE >nul 2>&1

:: Wait for the file to be released
timeout /t 3 /nobreak >nul

:: Delete the Gmail OST file
del "%localappdata%\Microsoft\Outlook\*.ost" /q >nul 2>&1

:: Relaunch Outlook
start "" "C:\Program Files (x86)\Microsoft Office\root\Office16\OUTLOOK.EXE"

Saved as OutlookFresh.bat and dropped into the Windows startup folder (shell:startup). Now every time they log in, Outlook starts fresh with a full resync from Gmail’s servers. The OST rebuild takes a minute or two depending on mailbox size, but after that everything — including any new folders created on other devices — is there.

Why This Works

The OST file is Outlook’s local cache of the IMAP mailbox. When the folder structure changes server-side, Outlook is supposed to pick it up during sync. In practice, it sometimes doesn’t — especially with Gmail’s label-as-folder IMAP mapping, which has always been a bit odd. Deleting the cache and forcing a rebuild from scratch bypasses whatever state Outlook has gotten itself into.

It’s not elegant, but it’s reliable. And for a non-technical user who just wants their folders to appear, a startup script they never have to think about is the right solution.

Environment: Windows, Outlook Classic (32-bit), Gmail via IMAP/SMTP.

Joshua Mein

Unattended-Upgrades Was Sending Mail to Gmail for Six Weeks. Gmail Was Silently Dropping All of It.

The Setup

DNS First, Because That’s the Easy Box to Tick

The A/B That Settled It

So Where Was the From: root Coming From?

Why Did the 2026-04-07 Deploy Validation Miss This?

The Fix

Rolling It Out

Validation, Properly This Time

A Sub-Issue: recipients=root 501 Errors

Takeaway

Automating Nextcloud AIO Updates with Bash and Cron

How AIO Updates Actually Work

The Script

Wiring it Into Cron

Verifying It’s Actually Working

A Subtle Point: “Successful” Doesn’t Mean “Updated”

What I’d Improve Next

Takeaway

I Built a GNOME Shell Extension for Tailscale — Panel Toggle, Peer Browser, and the Signal-Handler Gotcha That Broke It

The Setup

Why I Couldn’t Reuse Anything Existing

The Architecture

The “Don’t Touch My Daemon” Principle

The Polling Loop

The Bug That Took Me a Whole Evening

What Made It Onto the Panel

Gotchas I Hit Along the Way

1. Symlink installs will eat your source tree

2. Shell 48 is ESM. Shell 45 is not.

3. Gio.Subprocess is your friend

4. Adwaita prefs windows are easier than you’d think

Releasing It

What I’d Do Differently

The Result

How I Fixed SSL Certificate Warnings Across My Entire Proxmox Homelab — With Full Auto-Renewal and Email Alerts

My Setup

Why Standard Let’s Encrypt Doesn’t Work Here

The Wildcard Strategy

Step 1: Install acme.sh

Step 2: Create a Cloudflare API Token

Step 3: Issue the Wildcard Cert

Step 4: Deploy to All Servers

The Gotcha: PBS Fingerprints in storage.cfg

The Second Gotcha: PBS Sync Job Remotes

Step 5: The Auto-Renewal Script

Step 6: Email Notifications

Step 7: DNS Records

The Full Automated Flow

Summary

Running ComfyUI on an AMD RX 7900 XTX — Native ROCm 7.1 on Windows

The Problem

What’s Already Required

Installing uv

Cloning ComfyUI

Creating the Python Environment

Installing ROCm SDK Wheels

Installing ROCm PyTorch

Installing ComfyUI Requirements

Custom Nodes

Launcher Scripts

Validating the GPU

First Launch

Model Placement

Wan2.1 i2v 480p

Performance

Quality vs Speed: FP8 vs BF16

Known Issues

Lessons Learned

Zero-Shot Voice Cloning on AMD — ROCm 7.1 on Windows, F5-TTS, and the ONNX Fallback

The Setup

The Journey (Short Version)

ROCm 7.1 on Windows — The Real Solution

Setting Up the ROCm Venv

Required Environment Variables

Compatibility Patches

Running It

The Architecture: Full GPU vs Hybrid

ROCm Native (Full GPU)

So Where Was the `From: root` Coming From?

A Sub-Issue: `recipients=root` 501 Errors

3. `Gio.Subprocess` is your friend

Standard Objects (`aws s3 sync`)

Patch 1: Empty Model Cache Guard (`models.py`)

Patch 2: Default Feature Flags (`middleware.py`)

Patch 3: Default max_tokens (`middleware.py`)

Patch 4: Bedrock Gateway Model Caching (`model_patched.py`)