<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://joshwaamein.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://joshwaamein.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-22T00:19:58+01:00</updated><id>https://joshwaamein.github.io/feed.xml</id><title type="html">Joshua Mein</title><subtitle>A tech blog covering homelab setups, Linux administration, DevOps, cloud computing, networking, security, and IoT projects.</subtitle><author><name>Joshua Mein</name></author><entry><title type="html">Unattended-Upgrades Was Sending Mail to Gmail for Six Weeks. Gmail Was Silently Dropping All of It.</title><link href="https://joshwaamein.github.io/posts/unattended-upgrades-mail-from-header-gmail-silent-drop/" rel="alternate" type="text/html" title="Unattended-Upgrades Was Sending Mail to Gmail for Six Weeks. Gmail Was Silently Dropping All of It." /><published>2026-05-21T22:00:00+01:00</published><updated>2026-05-21T22:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/unattended-upgrades-mail-from-header-gmail-silent-drop</id><content type="html" xml:base="https://joshwaamein.github.io/posts/unattended-upgrades-mail-from-header-gmail-silent-drop/"><![CDATA[<p>Six weeks ago I rolled out <code class="language-plaintext highlighter-rouge">unattended-upgrades</code> across every Linux host in my homelab. 34 servers, one Ansible playbook, msmtp pointed at Brevo as the relay. The deploy went green. Every host’s <code class="language-plaintext highlighter-rouge">/var/log/msmtp.log</code> showed <code class="language-plaintext highlighter-rouge">smtpstatus=250 ... exitcode=EX_OK</code> for every send. Job done.</p>

<p>To be clear up front: the actual <em>patching</em> worked perfectly the whole time. Zabbix was scraping <code class="language-plaintext highlighter-rouge">apt</code> package counts and the systemd timers on every host, so I could see the daily 05:00 / 06:00 runs firing, packages getting upgraded, and reboots happening on schedule. That part was never in doubt. The bit that quietly didn’t work was the <strong>mail report on change</strong>, which was supposed to land in my Gmail every time a host actually upgraded something so I’d see what changed. “Confirm those reports actually arrive” sat as a backlog item for six weeks, because the rest of the chain was so visibly healthy that it was easy to keep deferring.</p>

<p>When I finally got around to it today, I went looking for one of those reports in Gmail. There were none. Not in Inbox. Not in Spam. Not in All Mail. Not one, ever, since the day of the deploy.</p>

<p>This is the story of how a single missing line of config managed to look perfectly healthy at every layer except the one that actually mattered.</p>

<h2 id="the-setup">The Setup</h2>

<p>For context: the SMTP backbone here is the same one I wrote about in <a href="/posts/why-i-switched-from-gmail-to-brevo-for-homelab-email-alerts/">Why I Switched From Gmail to Brevo for All My Homelab Email Alerts</a>. Every host has <code class="language-plaintext highlighter-rouge">msmtp-mta</code> installed with a 600-permission <code class="language-plaintext highlighter-rouge">/etc/msmtprc</code> pointing at <code class="language-plaintext highlighter-rouge">smtp-relay.brevo.com:587</code>. UU’s <code class="language-plaintext highlighter-rouge">Mail</code> directive sends to a Gmail address. The path is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unattended-upgrades  --&gt;  /usr/sbin/sendmail (msmtp symlink)  --&gt;  Brevo SMTP  --&gt;  Gmail
</code></pre></div></div>

<p>The audit started as “are these even working?” and ended somewhere different. The first pass was easy: SSH to every host, send a tagged test email through that host’s own msmtp, confirm <code class="language-plaintext highlighter-rouge">smtpstatus=250</code> post-send, write a per-group results file. 32 of 34 reachable hosts passed. One host was missing the <code class="language-plaintext highlighter-rouge">msmtp-mta</code> package entirely (a separate problem, fix queued). One was offline (a laptop PBS, expected).</p>

<p>The 32-pass result was correct as far as it went. <strong>Brevo was happily accepting every single message.</strong></p>

<p>What I didn’t think to test was the actual delivery. None of those test mails were ever opened by a human. They were just signals that Brevo’s SMTP server was returning 250.</p>

<p>Good question to ask: are these even arriving?</p>

<h2 id="dns-first-because-thats-the-easy-box-to-tick">DNS First, Because That’s the Easy Box to Tick</h2>

<p>If Brevo is queuing messages but Gmail isn’t delivering them, the first place to look is whether the sending domain is even in good standing.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dig +short TXT yourdomain.example
dig +short TXT _dmarc.yourdomain.example
dig +short TXT selector1._domainkey.yourdomain.example
</code></pre></div></div>

<p>What I found:</p>

<table>
  <thead>
    <tr>
      <th>Record</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>SPF</td>
      <td>none</td>
    </tr>
    <tr>
      <td>DMARC</td>
      <td><code class="language-plaintext highlighter-rouge">v=DMARC1; p=none; rua=mailto:rua@dmarc.brevo.com</code></td>
    </tr>
    <tr>
      <td>DKIM (<code class="language-plaintext highlighter-rouge">selector1._domainkey</code>)</td>
      <td>present</td>
    </tr>
    <tr>
      <td>MX</td>
      <td>none</td>
    </tr>
  </tbody>
</table>

<p>So:</p>

<ul>
  <li><strong>No SPF.</strong> Means SPF alignment can’t help us. Whatever Gmail makes of authenticity has to come from DKIM.</li>
  <li><strong>DMARC is <code class="language-plaintext highlighter-rouge">p=none</code>.</strong> Gmail won’t bounce a misaligned message; it’ll either send it to spam or drop it on the floor and tell <code class="language-plaintext highlighter-rouge">rua@dmarc.brevo.com</code> about it. No NDR comes back to me.</li>
  <li><strong>DKIM is set up correctly</strong> by Brevo. They sign with their own keys for <code class="language-plaintext highlighter-rouge">d=yourdomain.example</code> because I delegated the selector to them when I switched.</li>
</ul>

<p>That mostly absolves DNS. Brevo’s DKIM signing was working. So why doesn’t Gmail like the messages?</p>

<h2 id="the-ab-that-settled-it">The A/B That Settled It</h2>

<p>I sent two emails from the same host, through the same msmtp config, to the same Gmail address, about a second apart. The only difference was the message-level <code class="language-plaintext highlighter-rouge">From:</code> header.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Test 1: From: root</span>
<span class="o">{</span>
  <span class="nb">echo</span> <span class="s2">"From: root"</span>
  <span class="nb">echo</span> <span class="s2">"To: you@gmail.example"</span>
  <span class="nb">echo</span> <span class="s2">"Subject: [TEST] From: root"</span>
  <span class="nb">echo</span> <span class="s2">""</span>
  <span class="nb">echo</span> <span class="s2">"This is what unattended-upgrades sends by default."</span>
<span class="o">}</span> | /usr/sbin/sendmail <span class="nt">-t</span> <span class="nt">-oi</span>

<span class="c"># Test 2: From: a real address on the sending domain</span>
<span class="o">{</span>
  <span class="nb">echo</span> <span class="s2">"From: unattended-upgrades@yourdomain.example"</span>
  <span class="nb">echo</span> <span class="s2">"To: you@gmail.example"</span>
  <span class="nb">echo</span> <span class="s2">"Subject: [TEST] From: real-address"</span>
  <span class="nb">echo</span> <span class="s2">""</span>
  <span class="nb">echo</span> <span class="s2">"This is what UU sends with Sender configured."</span>
<span class="o">}</span> | /usr/sbin/sendmail <span class="nt">-t</span> <span class="nt">-oi</span>
</code></pre></div></div>

<p>Both came back from msmtp with <code class="language-plaintext highlighter-rouge">smtpstatus=250 ... exitcode=EX_OK</code>. Brevo accepted both.</p>

<p>Only the second one arrived in Gmail.</p>

<p>The first one, with <code class="language-plaintext highlighter-rouge">From: root</code>, just disappeared.</p>

<h2 id="so-where-was-the-from-root-coming-from">So Where Was the <code class="language-plaintext highlighter-rouge">From: root</code> Coming From?</h2>

<p>I had to actually open <code class="language-plaintext highlighter-rouge">/usr/bin/unattended-upgrade</code> (a Python script despite the name) and grep around. The relevant code is on or near line 1506 of unattended-upgrades 2.9.x:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">from_email</span> <span class="o">=</span> <span class="n">apt_pkg</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="nf">find</span><span class="p">(</span><span class="sh">"</span><span class="s">Unattended-Upgrade::Sender</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">root</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Read it once and the bug is right there. UU calls <code class="language-plaintext highlighter-rouge">apt_pkg.config.find</code> with the directive name and a default. The default is the string <code class="language-plaintext highlighter-rouge">"root"</code>. Literal <code class="language-plaintext highlighter-rouge">root</code>. No <code class="language-plaintext highlighter-rouge">@</code>, no domain.</p>

<p>When <code class="language-plaintext highlighter-rouge">Unattended-Upgrade::Sender</code> is unconfigured, UU writes <code class="language-plaintext highlighter-rouge">From: root</code> into the message body before piping the whole thing into <code class="language-plaintext highlighter-rouge">/usr/sbin/sendmail</code>. msmtp picks it up, hands it to Brevo. Brevo doesn’t care about the message-level <code class="language-plaintext highlighter-rouge">From:</code>; it cares about the SMTP envelope <code class="language-plaintext highlighter-rouge">MAIL FROM:</code> (<code class="language-plaintext highlighter-rouge">unattended-upgrades@yourdomain.example</code>, from msmtprc), DKIM-signs the message for <code class="language-plaintext highlighter-rouge">d=yourdomain.example</code>, and queues it.</p>

<p>Gmail then receives a message that says, in the header:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>From: root
</code></pre></div></div>

<p>And starts asking awkward questions:</p>

<ol>
  <li>RFC-5322 says the <code class="language-plaintext highlighter-rouge">From:</code> header must contain at least one mailbox address with a domain. Bare <code class="language-plaintext highlighter-rouge">root</code> is not a valid mailbox. That alone is a strong negative signal.</li>
  <li>DMARC alignment compares the <strong>header <code class="language-plaintext highlighter-rouge">From:</code> domain</strong> against the DKIM-signed <code class="language-plaintext highlighter-rouge">d=</code> domain. Header domain is empty (or whatever Gmail decides to do with <code class="language-plaintext highlighter-rouge">root</code>). DKIM <code class="language-plaintext highlighter-rouge">d=</code> is <code class="language-plaintext highlighter-rouge">yourdomain.example</code>. Alignment fails.</li>
  <li>With DMARC <code class="language-plaintext highlighter-rouge">p=none</code>, Gmail’s policy is “don’t bounce, just decide”. Gmail decided. The message is gone.</li>
</ol>

<p>This is also why the dropped messages don’t appear in Spam. Spam-foldering is a deliberate “this is suspicious but we’ll show it to you anyway” decision. A malformed <code class="language-plaintext highlighter-rouge">From:</code> that fails DMARC under <code class="language-plaintext highlighter-rouge">p=none</code> can be dropped before it ever gets to a folder.</p>

<h2 id="why-did-the-2026-04-07-deploy-validation-miss-this">Why Did the 2026-04-07 Deploy Validation Miss This?</h2>

<p>The validation criteria for the deploy were:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">apt-daily-upgrade.timer</code> enabled and active</li>
  <li><code class="language-plaintext highlighter-rouge">/etc/msmtprc</code> correct, mode 600</li>
  <li><code class="language-plaintext highlighter-rouge">/etc/apt/apt.conf.d/50unattended-upgrades</code> present with <code class="language-plaintext highlighter-rouge">Mail</code> and <code class="language-plaintext highlighter-rouge">MailReport</code></li>
  <li>A test send from each host returns msmtp exit 0 with Brevo <code class="language-plaintext highlighter-rouge">smtpstatus=250</code></li>
</ul>

<p>Every one of those was true on every host. Zabbix on top of that was telling me that the patches were actually landing. So at every monitoring layer, the deploy looked fine.</p>

<p>The thing nobody validated was the very last hop: “open the destination inbox and confirm the on-change report is actually there.” That step sat as a backlog item because the surrounding signal was so good. Hosts were patching themselves, Zabbix was happy, msmtp was returning 250. Why bother eyeballing Gmail?</p>

<p>UU’s <code class="language-plaintext highlighter-rouge">MailReport "on-change"</code> semantics make this worse, not better. On a quiet day with no upgrades, an empty inbox is the <em>correct</em> state. So the inbox looks identical whether the pipeline is healthy or completely broken. You only notice the gap on a day where you <em>expect</em> a report (because something upgraded) and one doesn’t show up. And if you’re not checking, you don’t notice.</p>

<p>The lesson is the same one in the blog post on the Proxmox SSL renewal flow: every automated email path needs a “did it actually arrive” check, not just a “did the sender return 0” check. I now have a small audit script that sends a tagged test email from each host with a unique X-Audit-Id, then I grep the destination inbox for the IDs. That’s the test the 2026-04-07 deploy didn’t have.</p>

<h2 id="the-fix">The Fix</h2>

<p>One line. Add <code class="language-plaintext highlighter-rouge">Unattended-Upgrade::Sender</code> to your <code class="language-plaintext highlighter-rouge">50unattended-upgrades</code> config:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Unattended-Upgrade::Sender "unattended-upgrades@yourdomain.example";
</code></pre></div></div>

<p>That value should match whatever address your relay actually DKIM-signs. In my case, that’s <code class="language-plaintext highlighter-rouge">unattended-upgrades@yourdomain.example</code> because Brevo signs everything from <code class="language-plaintext highlighter-rouge">d=yourdomain.example</code>. With it set, UU writes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>From: unattended-upgrades@yourdomain.example
</code></pre></div></div>

<p>Brevo signs for <code class="language-plaintext highlighter-rouge">d=yourdomain.example</code>. Gmail compares the header <code class="language-plaintext highlighter-rouge">From:</code> domain (<code class="language-plaintext highlighter-rouge">yourdomain.example</code>) against DKIM <code class="language-plaintext highlighter-rouge">d=</code> (<code class="language-plaintext highlighter-rouge">yourdomain.example</code>). Aligned. Accepted. Delivered.</p>

<p>In the Ansible playbook that drives my fleet, the change is two lines per role block (one for VM hosts, one for Proxmox hypervisors):</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code>           Unattended-Upgrade::MailReport "";
<span class="gi">+          // Sender added: UU defaults the From: header to the literal "root" if
+          // this is unset, which Gmail drops because DMARC alignment fails.
+          Unattended-Upgrade::Sender "";
</span>           Unattended-Upgrade::SyslogEnable "true";
</code></pre></div></div>

<p>I templated it off the existing <code class="language-plaintext highlighter-rouge">unattended_upgrades_smtp_from</code> variable that’s already in <code class="language-plaintext highlighter-rouge">group_vars/all/vars.yml</code>, since that’s the same value msmtp uses for the SMTP envelope <code class="language-plaintext highlighter-rouge">MAIL FROM:</code>. One source of truth, no drift between header and envelope.</p>

<h2 id="rolling-it-out">Rolling It Out</h2>

<p>The playbook handles VM hosts and Proxmox hypervisors with two <code class="language-plaintext highlighter-rouge">when:</code> blocks (one for each, because the schedule offsets differ). I ran it with <code class="language-plaintext highlighter-rouge">--tags config</code> to only touch the apt config, no package re-installs:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ansible-playbook configure-unattended-upgrades.yml <span class="nt">--tags</span> config
</code></pre></div></div>

<p>Three hosts failed on the first run with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Failed to get information on remote file (/etc/apt/apt.conf.d/50unattended-upgrades):
  /bin/sh: 1: sudo: not found
</code></pre></div></div>

<p>Those were the three Proxmox Backup Server VMs. The PBS appliance image runs as <code class="language-plaintext highlighter-rouge">root</code> and doesn’t ship <code class="language-plaintext highlighter-rouge">sudo</code>. Easy fix: re-run scoped to your <code class="language-plaintext highlighter-rouge">[pbs]</code> inventory group with <code class="language-plaintext highlighter-rouge">become</code> disabled.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ansible-playbook configure-unattended-upgrades.yml <span class="nt">--tags</span> config <span class="se">\</span>
  <span class="nt">--limit</span> pbs <span class="se">\</span>
  <span class="nt">-e</span> <span class="nv">ansible_become</span><span class="o">=</span><span class="nb">false</span>
</code></pre></div></div>

<p>Worth fixing in inventory long-term so the override isn’t needed each time, but for a one-shot patch the <code class="language-plaintext highlighter-rouge">-e</code> works.</p>

<p>After both runs, the live config on a representative sample showed the new directive everywhere:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>ssh root@host <span class="nb">grep</span> <span class="s1">'^Unattended-Upgrade::'</span> /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Origins-Pattern <span class="o">{</span> ... <span class="o">}</span><span class="p">;</span>
Unattended-Upgrade::Mail <span class="s2">"you@gmail.example"</span><span class="p">;</span>
Unattended-Upgrade::MailReport <span class="s2">"on-change"</span><span class="p">;</span>
Unattended-Upgrade::Sender <span class="s2">"unattended-upgrades@yourdomain.example"</span><span class="p">;</span>
Unattended-Upgrade::SyslogEnable <span class="s2">"true"</span><span class="p">;</span>
</code></pre></div></div>

<p>A re-run after the patch is the cleanest way to confirm idempotency. All hosts came back <code class="language-plaintext highlighter-rouge">changed=0</code>, so the templated value renders to the same bytes as the previous version (which I’d briefly hardcoded during the diagnosis).</p>

<h2 id="validation-properly-this-time">Validation, Properly This Time</h2>

<p>The first time around, “did the deploy work” stopped at “msmtp exit 0”. That was wrong. Here’s what the new validation actually checks, end to end:</p>

<ol>
  <li>
    <p><strong>Config file shows the new directive.</strong> On a sample host:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="s1">'^Unattended-Upgrade::Sender'</span> /etc/apt/apt.conf.d/50unattended-upgrades
</code></pre></div>    </div>

    <p>Expect the configured address back. Empty result means the playbook didn’t touch this host.</p>
  </li>
  <li>
    <p><strong>A real UU run produces a real email.</strong> UU only sends mail when packages were actually upgraded (because <code class="language-plaintext highlighter-rouge">MailReport "on-change"</code>). To force a send for testing, either wait for the next quiet upgrade day, or trigger it manually with a tagged message that goes through the same <code class="language-plaintext highlighter-rouge">/usr/sbin/sendmail</code> symlink:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">{</span>
  <span class="nb">echo</span> <span class="s2">"From: unattended-upgrades@yourdomain.example"</span>
  <span class="nb">echo</span> <span class="s2">"To: you@gmail.example"</span>
  <span class="nb">echo</span> <span class="s2">"Subject: [post-deploy verify] </span><span class="si">$(</span><span class="nb">hostname</span><span class="si">)</span><span class="s2"> </span><span class="si">$(</span><span class="nb">date</span> <span class="nt">-Is</span><span class="si">)</span><span class="s2">"</span>
  <span class="nb">echo</span> <span class="s2">""</span>
  <span class="nb">echo</span> <span class="s2">"This is the same SMTP path UU uses."</span>
<span class="o">}</span> | /usr/sbin/sendmail <span class="nt">-t</span> <span class="nt">-oi</span>
</code></pre></div>    </div>

    <p>Then open Gmail. Not “check the msmtp log”. <em>Open Gmail</em>.</p>
  </li>
  <li>
    <p><strong>Confirm the headers.</strong> When the next real on-change report arrives in your inbox, expand the headers and look for:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">From: unattended-upgrades@yourdomain.example</code> (not <code class="language-plaintext highlighter-rouge">From: root</code>)</li>
      <li><code class="language-plaintext highlighter-rouge">Authentication-Results:</code> showing <code class="language-plaintext highlighter-rouge">dkim=pass header.i=@yourdomain.example</code></li>
      <li><code class="language-plaintext highlighter-rouge">Authentication-Results:</code> showing <code class="language-plaintext highlighter-rouge">dmarc=pass</code></li>
    </ul>

    <p>If any of those are off, you’ve got a different problem than the one in this post. (The most likely candidate is that your relay isn’t DKIM-signing for the domain in your <code class="language-plaintext highlighter-rouge">From:</code> header. Check the relay’s domain authentication panel.)</p>
  </li>
</ol>

<h2 id="a-sub-issue-recipientsroot-501-errors">A Sub-Issue: <code class="language-plaintext highlighter-rouge">recipients=root</code> 501 Errors</h2>

<p>Wholly separate but worth a side note for anyone running the same audit. Several hosts on my fleet had repeating entries in <code class="language-plaintext highlighter-rouge">/var/log/msmtp.log</code> like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>recipients=root smtpstatus=501 errormsg='recipient address root not accepted by the server'
</code></pre></div></div>

<p>These are <em>not</em> from UU. UU explicitly addresses your configured <code class="language-plaintext highlighter-rouge">Mail</code> recipient. The <code class="language-plaintext highlighter-rouge">recipients=root</code> ones come from something else on the host (commonly cron’s default <code class="language-plaintext highlighter-rouge">MAILTO=root</code>, smartd, or apt-listchanges) handing mail to msmtp with envelope <code class="language-plaintext highlighter-rouge">RCPT TO: root</code>. Brevo rejects bare-username recipients at SMTP time with 501.</p>

<p>Two ways to fix it cleanly:</p>

<ol>
  <li>Set <code class="language-plaintext highlighter-rouge">aliases /etc/aliases</code> in <code class="language-plaintext highlighter-rouge">/etc/msmtprc</code> and add a <code class="language-plaintext highlighter-rouge">root: someone@somewhere.example</code> line to <code class="language-plaintext highlighter-rouge">/etc/aliases</code>. msmtp will rewrite the recipient before handing it to Brevo.</li>
  <li>Track down whatever is hardcoding <code class="language-plaintext highlighter-rouge">root</code> as a destination and point it at a real address.</li>
</ol>

<p>On one host the noise was so heavy (every 30 minutes) that the msmtp log had grown to 651 KB of error-only entries since the deploy. I’d missed it the first time around because nothing further downstream was complaining. Worth a fleet-wide grep for <code class="language-plaintext highlighter-rouge">smtpstatus=5</code> if you’re already in the area.</p>

<h2 id="takeaway">Takeaway</h2>

<p>The whole bug is one default in one Python file:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">from_email</span> <span class="o">=</span> <span class="n">apt_pkg</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="nf">find</span><span class="p">(</span><span class="sh">"</span><span class="s">Unattended-Upgrade::Sender</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">root</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>A bare <code class="language-plaintext highlighter-rouge">root</code> as a default is a sensible thing for a tool that ran for the first time on a UNIX box where local delivery actually meant something. In 2026, with everything going out through a relay that DKIM-signs for a domain you own, that default is a foot-gun. Set <code class="language-plaintext highlighter-rouge">Unattended-Upgrade::Sender</code> to something with a <code class="language-plaintext highlighter-rouge">@</code> and a domain that aligns with whatever your relay is signing, and the whole pipeline lights up.</p>

<p>If you’re running unattended-upgrades through msmtp / Postfix / nullmailer / any external relay, go look at your <code class="language-plaintext highlighter-rouge">50unattended-upgrades</code> right now and make sure <code class="language-plaintext highlighter-rouge">Sender</code> is set. If it isn’t, your alerts are probably already vanishing into the void.</p>

<p>The next post in this thread will be the audit script itself, with the per-host audit-id register that lets you grep your inbox for “did this specific host’s specific test mail actually arrive”. Sending a 250 OK is not the same as delivering a message, and after this one I’ll never trust an SMTP relay’s accept response as proof of delivery again.</p>]]></content><author><name>Joshua Mein</name></author><category term="Homelab" /><category term="DevOps" /><category term="unattended-upgrades" /><category term="ubuntu" /><category term="debian" /><category term="msmtp" /><category term="brevo" /><category term="smtp" /><category term="dmarc" /><category term="dkim" /><category term="ansible" /><category term="linux" /><summary type="html"><![CDATA[How a one-line apt config default left every host on my fleet sending email with From: root, why Brevo accepted those messages but Gmail dropped them on DMARC alignment, and the live A/B test that nailed it down to a single missing directive.]]></summary></entry><entry><title type="html">Automating Nextcloud AIO Updates with Bash and Cron</title><link href="https://joshwaamein.github.io/posts/automating-nextcloud-aio-updates-with-bash-and-cron/" rel="alternate" type="text/html" title="Automating Nextcloud AIO Updates with Bash and Cron" /><published>2026-05-14T22:00:00+01:00</published><updated>2026-05-14T22:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/automating-nextcloud-aio-updates-with-bash-and-cron</id><content type="html" xml:base="https://joshwaamein.github.io/posts/automating-nextcloud-aio-updates-with-bash-and-cron/"><![CDATA[<p>I run <a href="https://github.com/nextcloud/all-in-one">Nextcloud All-in-One</a>. It’s great. A bundle of containers wired together and managed through one web UI. One-click updates, sane defaults, and most of the moving parts you’d otherwise have to glue together yourself.</p>

<p>The one thing I wanted to change was the manual click-through flow for updates. AIO is designed around a UI-driven update workflow — open the master container’s web UI, click “Update all containers”, wait, click again. Perfectly fine for occasional use, but I’d much rather it just ran on its own on a sensible schedule and logged what it did.</p>

<p>Here’s how I got there with a small bash script and a cron entry.</p>

<h2 id="how-aio-updates-actually-work">How AIO Updates Actually Work</h2>

<p>Before writing anything I wanted to understand what the AIO master container actually <em>does</em> when you click “Update all containers” in the UI. Once you peel the wrapper off, it boils down to two things:</p>

<ol>
  <li><strong>Pull the new <code class="language-plaintext highlighter-rouge">nextcloud/all-in-one:latest</code> image.</strong> The mastercontainer is the brain — it pins compatible image versions for every child container. New AIO release = new mastercontainer image = new pinned versions for the children.</li>
  <li><strong>Run <a href="https://github.com/nextcloud/all-in-one/blob/main/php/src/Cron/StartAndUpdateContainers.php"><code class="language-plaintext highlighter-rouge">StartAndUpdateContainers.php</code></a>.</strong> This is the internal job that orchestrates stopping the old child containers, pulling their new images, and starting them back up. The web UI calls it. The internal cron calls it. So can I.</li>
</ol>

<p>If I invoke that PHP script directly, I get the same update path the UI button kicks off — just without needing a human to click anything.</p>

<h2 id="the-script">The Script</h2>

<p>This lives at <code class="language-plaintext highlighter-rouge">/root/update_nextcloud_aio.sh</code>. Seven steps, each timestamped, gated by <code class="language-plaintext highlighter-rouge">set -e</code> so a failure stops everything cleanly. The only part that’s environment-specific is the <code class="language-plaintext highlighter-rouge">docker run</code> block in Step 4 — see the note after the script.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#</span>
<span class="c"># Nextcloud AIO Update Script</span>
<span class="c">#</span>

<span class="nb">set</span> <span class="nt">-e</span>

log<span class="o">()</span> <span class="o">{</span> <span class="nb">echo</span> <span class="s2">"[</span><span class="si">$(</span><span class="nb">date</span> <span class="s1">'+%Y-%m-%d %H:%M:%S'</span><span class="si">)</span><span class="s2">] </span><span class="nv">$1</span><span class="s2">"</span><span class="p">;</span> <span class="o">}</span>

log <span class="s2">"Starting Nextcloud AIO Update Process"</span>

<span class="c"># Step 1: Pull latest AIO mastercontainer image</span>
log <span class="s2">"Step 1/7: Pulling latest Nextcloud AIO image..."</span>
docker pull nextcloud/all-in-one:latest

<span class="c"># Step 2: Stop the existing master container</span>
log <span class="s2">"Step 2/7: Stopping nextcloud-aio-mastercontainer..."</span>
docker stop nextcloud-aio-mastercontainer

<span class="c"># Step 3: Remove the existing master container</span>
log <span class="s2">"Step 3/7: Removing old container..."</span>
docker <span class="nb">rm </span>nextcloud-aio-mastercontainer

<span class="c"># Step 4: Recreate master with the exact same configuration</span>
<span class="c">#         (replace this block with whatever YOUR original `docker run` was)</span>
log <span class="s2">"Step 4/7: Recreating master container..."</span>
docker run <span class="nt">-d</span> <span class="se">\</span>
  <span class="nt">--name</span> nextcloud-aio-mastercontainer <span class="se">\</span>
  <span class="nt">--restart</span> always <span class="se">\</span>
  <span class="nt">--init</span> <span class="se">\</span>
  <span class="nt">-p</span> 8080:8080 <span class="se">\</span>
  <span class="nt">-v</span> nextcloud_aio_mastercontainer:/mnt/docker-aio-config <span class="se">\</span>
  <span class="nt">-v</span> /var/run/docker.sock:/var/run/docker.sock:ro <span class="se">\</span>
  nextcloud/all-in-one:latest

<span class="c"># Step 5: Wait for master to settle</span>
log <span class="s2">"Step 5/7: Waiting 60 seconds for master container to initialize..."</span>
<span class="nb">sleep </span>60

<span class="c"># Step 6: Force every child container to be recreated next start</span>
log <span class="s2">"Step 6/7: Stopping and removing all AIO child containers..."</span>
<span class="nb">set</span> +e
<span class="k">for </span>c <span class="k">in</span> <span class="si">$(</span>docker ps     <span class="nt">--filter</span> <span class="s2">"name=nextcloud-aio-"</span> <span class="nt">--format</span> <span class="s2">""</span> | <span class="nb">grep</span> <span class="nt">-v</span> mastercontainer<span class="si">)</span><span class="p">;</span> <span class="k">do
    </span>docker stop <span class="s2">"</span><span class="nv">$c</span><span class="s2">"</span>
<span class="k">done
for </span>c <span class="k">in</span> <span class="si">$(</span>docker ps <span class="nt">-a</span>  <span class="nt">--filter</span> <span class="s2">"name=nextcloud-aio-"</span> <span class="nt">--format</span> <span class="s2">""</span> | <span class="nb">grep</span> <span class="nt">-v</span> mastercontainer<span class="si">)</span><span class="p">;</span> <span class="k">do
    </span>docker <span class="nb">rm</span>   <span class="s2">"</span><span class="nv">$c</span><span class="s2">"</span>
<span class="k">done
</span><span class="nb">set</span> <span class="nt">-e</span>

<span class="c"># Step 7: Trigger AIO's internal update cron directly</span>
log <span class="s2">"Step 7/7: Triggering Nextcloud update..."</span>
docker <span class="nb">exec</span> <span class="nt">--user</span> www-data nextcloud-aio-mastercontainer <span class="se">\</span>
    php /var/www/docker-aio/php/src/Cron/StartAndUpdateContainers.php

log <span class="s2">"Nextcloud AIO Update Process Completed"</span>
</code></pre></div></div>

<p>A few things worth calling out:</p>

<ul>
  <li><strong>Step 4’s <code class="language-plaintext highlighter-rouge">docker run</code> block is the only non-portable part.</strong> Whatever flags, env vars, ports, and volumes you used when you originally created your master container have to be reproduced exactly, or the new master will look at the existing volume and refuse to start. Don’t copy mine — pull yours straight off your existing container with <code class="language-plaintext highlighter-rouge">docker inspect nextcloud-aio-mastercontainer</code> before changing a thing.</li>
  <li><strong>Step 6 is the gotcha I learned the hard way.</strong> My first version of this script just recreated the master and trusted it to handle the children. It didn’t. The mastercontainer happily came up, decided the children were “already running fine”, and skipped the upgrade entirely. Stopping and removing the children <em>first</em> forces <code class="language-plaintext highlighter-rouge">StartAndUpdateContainers.php</code> in Step 7 to recreate them from the new pinned images. Without this step the cron PHP is effectively a no-op.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">set +e</code> around Step 6 is intentional.</strong> Some children may not exist (e.g. you’ve never enabled Collabora). I don’t want a missing container to abort the whole update.</li>
</ul>

<h2 id="wiring-it-into-cron">Wiring it Into Cron</h2>

<p>Weekly is the sweet spot. Often enough to never be more than a week behind, infrequent enough that point releases have had a chance to settle.</p>

<pre><code class="language-cron"># root crontab
0 4 * * 0  /root/update_nextcloud_aio.sh &gt;&gt; /root/nextcloud_update.log 2&gt;&amp;1
</code></pre>

<p>Sunday morning, well outside any usage window. If anything explodes the worst case is “roll back from last night’s backup” and life carries on.</p>

<h2 id="verifying-its-actually-working">Verifying It’s Actually Working</h2>

<p>A script that “looks like” it’s doing something every Sunday isn’t worth much without proof. Three quick checks:</p>

<p><strong>1. Did the most recent run succeed?</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-E</span> <span class="s1">'Starting|Update command executed|ERROR|WARNING|Completed'</span> <span class="se">\</span>
    /root/nextcloud_update.log | <span class="nb">tail</span> <span class="nt">-20</span>
</code></pre></div></div>

<p><strong>2. How many runs have completed cleanly?</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Successful runs: </span><span class="si">$(</span><span class="nb">grep</span> <span class="nt">-c</span> <span class="s1">'Update command executed successfully'</span> /root/nextcloud_update.log<span class="si">)</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">"Errors/warnings: </span><span class="si">$(</span><span class="nb">grep</span> <span class="nt">-cE</span> <span class="s1">'ERROR|WARNING'</span>                     /root/nextcloud_update.log<span class="si">)</span><span class="s2">"</span>
</code></pre></div></div>

<p>I’ve been running this for a few months now with 0 errors and 0 warnings across every weekly invocation. Run durations sit between roughly one and five minutes.</p>

<p><strong>3. What version is Nextcloud actually on, and is anything pending?</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker <span class="nb">exec</span> <span class="nt">-u</span> www-data nextcloud-aio-nextcloud php occ status
docker <span class="nb">exec</span> <span class="nt">-u</span> www-data nextcloud-aio-nextcloud php occ update:check
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">occ status</code> confirms the running <code class="language-plaintext highlighter-rouge">versionstring</code>, and <code class="language-plaintext highlighter-rouge">occ update:check</code> will tell you if a newer point release is available — which is the real test of whether the script is actually moving you forward, not just running successfully.</p>

<h2 id="a-subtle-point-successful-doesnt-mean-updated">A Subtle Point: “Successful” Doesn’t Mean “Updated”</h2>

<p>Worth flagging because I caught myself on it. If the AIO project hasn’t published a new mastercontainer image since your last run, your <code class="language-plaintext highlighter-rouge">docker pull</code> legitimately gets the same digest, the script runs cleanly, the children get recreated from the same pinned versions — and your Nextcloud version doesn’t change. That’s not a script failure, that’s the script doing exactly what it should.</p>

<p>The way to sanity-check is to compare the running version against the <a href="https://nextcloud.com/changelog/">Nextcloud changelog</a>. If <code class="language-plaintext highlighter-rouge">occ status</code> reports an older point release than the latest in the changelog but the AIO image hasn’t bumped yet, the bottleneck is upstream’s release cadence — not your automation. The next scheduled run will pick it up the moment AIO publishes.</p>

<h2 id="what-id-improve-next">What I’d Improve Next</h2>

<ul>
  <li><strong>Email alert on failure.</strong> Right now I have to grep the log. Trivial to wire <code class="language-plaintext highlighter-rouge">mail</code> or an SMTP relay into a <code class="language-plaintext highlighter-rouge">trap</code> so any non-zero exit pings me.</li>
  <li><strong>Log rotation.</strong> The log file just grows. A small <code class="language-plaintext highlighter-rouge">logrotate</code> config to weekly-rotate it with a reasonable retention would be tidy.</li>
  <li><strong>Pre-flight version capture.</strong> Logging the running Nextcloud version <em>before</em> and <em>after</em> would make it obvious at a glance which weekly runs actually delivered a new release.</li>
  <li><strong>Master container health probe</strong> at the end — a quick check that the mastercontainer came back up cleanly before logging “Completed”.</li>
</ul>

<h2 id="takeaway">Takeaway</h2>

<p>The clean way to automate AIO is to do exactly what the master container does internally — pull the new image, recreate the children, run <code class="language-plaintext highlighter-rouge">StartAndUpdateContainers.php</code> — but call it directly so it can run on a schedule without any human in the loop. Wrap it in cron, log it, verify it weekly, and it just runs.</p>

<p>Take the script, swap your own <code class="language-plaintext highlighter-rouge">docker run</code> flags into Step 4, and you’re done.</p>]]></content><author><name>Joshua Mein</name></author><category term="Homelab" /><category term="DevOps" /><category term="nextcloud" /><category term="docker" /><category term="bash" /><category term="automation" /><category term="cron" /><category term="linux" /><category term="self-hosting" /><summary type="html"><![CDATA[A small bash script that pulls the latest Nextcloud All-in-One image, recreates the child containers, and triggers the update by calling AIO's internal cron PHP directly — turning AIO's manual UI update flow into something that just runs on a schedule, with proper logging and verification.]]></summary></entry><entry><title type="html">I Built a GNOME Shell Extension for Tailscale — Panel Toggle, Peer Browser, and the Signal-Handler Gotcha That Broke It</title><link href="https://joshwaamein.github.io/posts/gnome-tailscale-shell-extension/" rel="alternate" type="text/html" title="I Built a GNOME Shell Extension for Tailscale — Panel Toggle, Peer Browser, and the Signal-Handler Gotcha That Broke It" /><published>2026-05-14T22:00:00+01:00</published><updated>2026-05-14T22:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/gnome-tailscale-shell-extension</id><content type="html" xml:base="https://joshwaamein.github.io/posts/gnome-tailscale-shell-extension/"><![CDATA[<p>I run Tailscale on every machine I own. My homelab is stitched together with it, my laptop joins the tailnet on boot, and at this point I treat <code class="language-plaintext highlighter-rouge">100.64.0.0/10</code> like it’s part of my own LAN. It’s brilliant.</p>

<p>What’s <em>not</em> brilliant is the day-to-day UX on Linux. The CLI is excellent — but it lives in a terminal. GNOME’s built-in <strong>VPN</strong> panel applet doesn’t speak Tailscale’s control protocol, so all the things I actually click on (toggle, exit node, copy a peer’s IP, check who’s online) live behind <code class="language-plaintext highlighter-rouge">tailscale</code> subcommands. Every time I needed a peer’s IPv4 I’d open a terminal, type <code class="language-plaintext highlighter-rouge">tailscale status</code>, scroll, and copy. Every time.</p>

<p>So I built the small thing that should have always existed: a GNOME Shell panel indicator that wraps the bits of the Tailscale CLI you actually click on, without changing your daemon configuration unless you explicitly ask it to.</p>

<p>It’s called <a href="https://github.com/Joshwaamein/gnome-tailscale"><code class="language-plaintext highlighter-rouge">gnome-tailscale</code></a>, it ships for GNOME Shell <strong>48, 49, and 50</strong>, and it has roughly one bug I’m still slightly embarrassed about. Here’s the writeup.</p>

<hr />

<h2 id="the-setup">The Setup</h2>

<ul>
  <li>A laptop running Ubuntu 24.04 (GNOME Shell 46 → 48 after upgrade)</li>
  <li>A desktop running Fedora 40 (GNOME Shell 48)</li>
  <li>A future-me on Ubuntu 26.04 “Resolute Raccoon” (GNOME Shell 50)</li>
  <li>Tailscale CLI installed everywhere, <code class="language-plaintext highlighter-rouge">tailscaled</code> always running</li>
</ul>

<p>The goal: a <strong>panel indicator</strong> that reflects daemon state, lets me toggle the daemon, lists my tailnet, copies peer IPs on click, picks an exit node from a submenu, and surfaces actionable error messages — no terminal required.</p>

<h2 id="why-i-couldnt-reuse-anything-existing">Why I Couldn’t Reuse Anything Existing</h2>

<p>Before writing a line of GJS, I went through the usual dead ends:</p>

<ol>
  <li><strong>GNOME’s built-in VPN panel.</strong> It speaks NetworkManager. Tailscale is a userland mesh — it doesn’t expose itself as an <code class="language-plaintext highlighter-rouge">NM</code> connection. Dead end.</li>
  <li><strong>The “Tailscale Status” extensions on extensions.gnome.org.</strong> Most are stuck on GNOME Shell 42–45 (the old <code class="language-plaintext highlighter-rouge">imports.*</code> CommonJS world). Shell 48 is fully ESM (<code class="language-plaintext highlighter-rouge">import</code> / <code class="language-plaintext highlighter-rouge">export</code>), and the old-API extensions are not loadable at all on Shell 48+. Re-skinning a 3-year-old codebase to ESM was going to take longer than starting fresh.</li>
  <li><strong>A standalone tray app via Ayatana AppIndicator.</strong> Works, but doesn’t blend into the panel and breaks every time GNOME twitches its mind about tray icons.</li>
  <li><strong>A bash script bound to a keyboard shortcut.</strong> Toggle works, but there’s nowhere to <em>show</em> the peer list.</li>
</ol>

<p>The only sensible option was a native shell extension targeted at the <strong>GJS ESM era</strong> — Shell 48, 49, and 50.</p>

<hr />

<h2 id="the-architecture">The Architecture</h2>

<p>I wanted the extension to be small and testable. GJS is fun until you try to unit-test it; the runtime is bound to GNOME Shell, so anything that touches <code class="language-plaintext highlighter-rouge">St</code>, <code class="language-plaintext highlighter-rouge">Clutter</code>, or <code class="language-plaintext highlighter-rouge">Gio</code> won’t run under plain Node.</p>

<p>So I split the codebase three ways:</p>

<table>
  <thead>
    <tr>
      <th>File</th>
      <th>Runtime</th>
      <th>What’s in it</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">extension.js</code></td>
      <td>GJS</td>
      <td>The panel indicator — <code class="language-plaintext highlighter-rouge">St</code> widgets, menu items, the polling loop.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">prefs.js</code></td>
      <td>GJS (Adwaita)</td>
      <td>The preferences window.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lib/util.js</code></td>
      <td><strong>Pure JS</strong></td>
      <td>Formatting, sorting, argv builders, error classification.</td>
    </tr>
  </tbody>
</table>

<p><code class="language-plaintext highlighter-rouge">lib/util.js</code> is the trick. Anything that’s pure logic — parsing <code class="language-plaintext highlighter-rouge">tailscale status --json</code>, sorting peers, working out which <code class="language-plaintext highlighter-rouge">tailscale</code> argv to spawn for a given preference combination, classifying error output into one of about eight known categories — lives there with <strong>zero GJS imports</strong>. It’s runnable under Node’s built-in test runner, which means CI can lint and test the extension without ever touching a real GNOME Shell.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nb">test</span>       <span class="c"># runs node --test on tests/*</span>
make lint       <span class="c"># eslint on the whole tree</span>
make schema     <span class="c"># compiles the GSettings schema</span>
make ci         <span class="c"># everything CI runs</span>
make pack       <span class="c"># builds the release zip</span>
</code></pre></div></div>

<p>The whole thing is &lt; 2,000 lines including tests.</p>

<hr />

<h2 id="the-dont-touch-my-daemon-principle">The “Don’t Touch My Daemon” Principle</h2>

<p>The single most important design decision: <strong>toggling the panel switch does not change your daemon configuration</strong>. Ever. By default, the toggle runs <code class="language-plaintext highlighter-rouge">tailscale up</code> and <em>that’s it</em>. It does not push <code class="language-plaintext highlighter-rouge">--accept-routes</code>, it does not push <code class="language-plaintext highlighter-rouge">--accept-dns</code>, it does not run <code class="language-plaintext highlighter-rouge">tailscale set</code> for anything.</p>

<p>Why that matters: if you’ve spent an evening tuning your <code class="language-plaintext highlighter-rouge">tailscaled</code> flags exactly the way you like them, the last thing you want is a friendly little panel applet quietly overwriting them every time you click it. I’ve been bitten by exactly that on other VPN GUIs.</p>

<p>There’s a single switch in prefs called <em>Override accept-routes / accept-dns on connect</em>. It’s <strong>off by default</strong>. Turn it on if you want the panel to actively manage those flags via <code class="language-plaintext highlighter-rouge">tailscale set</code> after each <code class="language-plaintext highlighter-rouge">up</code>. Otherwise the panel is purely an <em>observer plus toggle</em>.</p>

<p>The privileged-command path is similar:</p>

<table>
  <thead>
    <tr>
      <th>Setting</th>
      <th>Default</th>
      <th>Behaviour</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Use pkexec for up/down</td>
      <td><strong>on</strong></td>
      <td>Privileged <code class="language-plaintext highlighter-rouge">tailscale up</code>/<code class="language-plaintext highlighter-rouge">down</code> go through a polkit dialog.</td>
    </tr>
    <tr>
      <td>Use pkexec for up/down</td>
      <td>off</td>
      <td>Assumes you’ve run <code class="language-plaintext highlighter-rouge">sudo tailscale set --operator=$USER</code> and <code class="language-plaintext highlighter-rouge">tailscale</code> runs without sudo.</td>
    </tr>
  </tbody>
</table>

<p>The prefs window literally tells you the two recipes (Option A: <code class="language-plaintext highlighter-rouge">--operator</code>, Option B: a sudoers alias) and explains that Option B is a terminal-only convenience and won’t help the panel toggle. Trying to be a polite citizen of someone else’s machine.</p>

<hr />

<h2 id="the-polling-loop">The Polling Loop</h2>

<p><code class="language-plaintext highlighter-rouge">tailscale status --json</code> is the source of truth. The extension polls it every 5 seconds (configurable) and rebuilds the menu from scratch each tick. Nothing about the peer list or exit-node list is hardcoded — every menu item exists because it appeared in the most recent JSON.</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// extension.js (simplified)</span>
<span class="k">async</span> <span class="nf">_tick</span><span class="p">()</span> <span class="p">{</span>
    <span class="kd">let</span> <span class="nx">proc</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="nx">proc</span> <span class="o">=</span> <span class="nx">Gio</span><span class="p">.</span><span class="nx">Subprocess</span><span class="p">.</span><span class="k">new</span><span class="p">(</span>
            <span class="p">[</span><span class="dl">'</span><span class="s1">tailscale</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">status</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">--json</span><span class="dl">'</span><span class="p">],</span>
            <span class="nx">Gio</span><span class="p">.</span><span class="nx">SubprocessFlags</span><span class="p">.</span><span class="nx">STDOUT_PIPE</span> <span class="o">|</span> <span class="nx">Gio</span><span class="p">.</span><span class="nx">SubprocessFlags</span><span class="p">.</span><span class="nx">STDERR_PIPE</span>
        <span class="p">);</span>
    <span class="p">}</span> <span class="k">catch </span><span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="k">this</span><span class="p">.</span><span class="nf">_showError</span><span class="p">(</span><span class="nf">classifyError</span><span class="p">(</span><span class="nx">e</span><span class="p">));</span>
    <span class="p">}</span>

    <span class="kd">const</span> <span class="p">[,</span> <span class="nx">stdout</span><span class="p">,</span> <span class="nx">stderr</span><span class="p">]</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">proc</span><span class="p">.</span><span class="nf">communicate_utf8_async</span><span class="p">(</span><span class="kc">null</span><span class="p">,</span> <span class="kc">null</span><span class="p">);</span>
    <span class="k">if </span><span class="p">(</span><span class="o">!</span><span class="nx">proc</span><span class="p">.</span><span class="nf">get_successful</span><span class="p">())</span> <span class="p">{</span>
        <span class="k">return</span> <span class="k">this</span><span class="p">.</span><span class="nf">_showError</span><span class="p">(</span><span class="nf">classifyError</span><span class="p">(</span><span class="nx">stderr</span><span class="p">));</span>
    <span class="p">}</span>

    <span class="kd">const</span> <span class="nx">status</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">stdout</span><span class="p">);</span>
    <span class="k">this</span><span class="p">.</span><span class="nf">_render</span><span class="p">(</span><span class="nx">status</span><span class="p">);</span>   <span class="c1">// pure: status -&gt; menu items</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">classifyError</code> lives in <code class="language-plaintext highlighter-rouge">lib/util.js</code> and is unit-tested. It maps stderr blobs onto a small enum:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// lib/util.js</span>
<span class="k">export</span> <span class="kd">function</span> <span class="nf">classifyError</span><span class="p">(</span><span class="nx">stderr</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if </span><span class="p">(</span><span class="sr">/command not found|ENOENT/</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">stderr</span><span class="p">))</span>   <span class="k">return</span> <span class="dl">'</span><span class="s1">CLI_MISSING</span><span class="dl">'</span><span class="p">;</span>
    <span class="k">if </span><span class="p">(</span><span class="sr">/not running|connection refused/i</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">stderr</span><span class="p">))</span> <span class="k">return</span> <span class="dl">'</span><span class="s1">DAEMON_DOWN</span><span class="dl">'</span><span class="p">;</span>
    <span class="k">if </span><span class="p">(</span><span class="sr">/Logged out|please run.*tailscale up/i</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">stderr</span><span class="p">))</span> <span class="k">return</span> <span class="dl">'</span><span class="s1">LOGGED_OUT</span><span class="dl">'</span><span class="p">;</span>
    <span class="k">if </span><span class="p">(</span><span class="sr">/Authentication cancelled|polkit/i</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">stderr</span><span class="p">))</span> <span class="k">return</span> <span class="dl">'</span><span class="s1">PKEXEC_CANCELLED</span><span class="dl">'</span><span class="p">;</span>
    <span class="k">if </span><span class="p">(</span><span class="sr">/permission denied|operator/i</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">stderr</span><span class="p">))</span> <span class="k">return</span> <span class="dl">'</span><span class="s1">NO_OPERATOR</span><span class="dl">'</span><span class="p">;</span>
    <span class="k">return</span> <span class="dl">'</span><span class="s1">UNKNOWN</span><span class="dl">'</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That enum drives both the user-facing notification copy <em>and</em> a <em>Copy error details</em> item in the menu, so when something does go sideways you can paste the raw stderr into a bug report instead of squinting at a vague “something went wrong”.</p>

<hr />

<h2 id="the-bug-that-took-me-a-whole-evening">The Bug That Took Me a Whole Evening</h2>

<p>Here’s the embarrassing one. Early users (i.e. me) reported:</p>

<blockquote>
  <p>The toggle works the <strong>first</strong> time. After that, clicking it does nothing.</p>
</blockquote>

<p>The toggle is a <code class="language-plaintext highlighter-rouge">PopupSwitchMenuItem</code>. Naively, you connect to its <code class="language-plaintext highlighter-rouge">'toggled'</code> signal and call <code class="language-plaintext highlighter-rouge">tailscale up</code> or <code class="language-plaintext highlighter-rouge">tailscale down</code> accordingly:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// THE BUG</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">connect</span><span class="p">(</span><span class="dl">'</span><span class="s1">toggled</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">item</span><span class="p">,</span> <span class="nx">active</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="k">if </span><span class="p">(</span><span class="nx">active</span><span class="p">)</span> <span class="k">this</span><span class="p">.</span><span class="nf">_tailscaleUp</span><span class="p">();</span>
    <span class="k">else</span>        <span class="k">this</span><span class="p">.</span><span class="nf">_tailscaleDown</span><span class="p">();</span>
<span class="p">});</span>
</code></pre></div></div>

<p>Then every poll, you reflect the <em>real</em> daemon state back onto the switch:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// THE BUG, continued</span>
<span class="nf">_render</span><span class="p">(</span><span class="nx">status</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">setToggleState</span><span class="p">(</span><span class="nx">status</span><span class="p">.</span><span class="nx">BackendState</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">Running</span><span class="dl">'</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Spot it? <code class="language-plaintext highlighter-rouge">setToggleState()</code> <strong>fires the <code class="language-plaintext highlighter-rouge">toggled</code> signal</strong>. So:</p>

<ol>
  <li>User clicks the switch → <code class="language-plaintext highlighter-rouge">'toggled'</code> fires with <code class="language-plaintext highlighter-rouge">active=true</code> → <code class="language-plaintext highlighter-rouge">tailscale up</code> runs.</li>
  <li>Five seconds later, poll completes, <code class="language-plaintext highlighter-rouge">setToggleState(true)</code> is called.</li>
  <li><code class="language-plaintext highlighter-rouge">setToggleState(true)</code> fires <code class="language-plaintext highlighter-rouge">'toggled'</code> <em>again</em> with <code class="language-plaintext highlighter-rouge">active=true</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">tailscale up</code> runs <em>again</em> — harmless because the daemon is already up.</li>
  <li>User clicks the switch <em>off</em> → <code class="language-plaintext highlighter-rouge">'toggled'</code> fires with <code class="language-plaintext highlighter-rouge">active=false</code> → <code class="language-plaintext highlighter-rouge">tailscale down</code> runs.</li>
  <li>Five seconds later, poll completes, <code class="language-plaintext highlighter-rouge">setToggleState(false)</code> is called.</li>
  <li><code class="language-plaintext highlighter-rouge">setToggleState(false)</code> fires <code class="language-plaintext highlighter-rouge">'toggled'</code> again with <code class="language-plaintext highlighter-rouge">active=false</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">tailscale down</code> runs <em>again</em>. Daemon is already down. Still harmless.</li>
  <li>User clicks the switch <em>on</em> → <code class="language-plaintext highlighter-rouge">'toggled'</code> fires with <code class="language-plaintext highlighter-rouge">active=true</code>…</li>
</ol>

<p>…except by step 9, the recursive <code class="language-plaintext highlighter-rouge">'toggled'</code> from step 7 has <em>also</em> fired, and the user-initiated state change races against the programmatic one. Depending on which finishes first, the switch can end up visually <em>off</em> while my handler genuinely thought the user wanted it <em>on</em>. From the user’s perspective: clicking does nothing.</p>

<p>The fix is one line, sort of:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// THE FIX (extension.js)</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_toggleHandlerId</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">connect</span><span class="p">(</span><span class="dl">'</span><span class="s1">toggled</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">_</span><span class="p">,</span> <span class="nx">active</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="k">if </span><span class="p">(</span><span class="nx">active</span><span class="p">)</span> <span class="k">this</span><span class="p">.</span><span class="nf">_tailscaleUp</span><span class="p">();</span>
    <span class="k">else</span>        <span class="k">this</span><span class="p">.</span><span class="nf">_tailscaleDown</span><span class="p">();</span>
<span class="p">});</span>

<span class="nf">_render</span><span class="p">(</span><span class="nx">status</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Block the handler while we mirror daemon state onto the switch,</span>
    <span class="c1">// so the programmatic update doesn't re-fire 'toggled'.</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">block_signal_handler</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">_toggleHandlerId</span><span class="p">);</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">setToggleState</span><span class="p">(</span><span class="nx">status</span><span class="p">.</span><span class="nx">BackendState</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">Running</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">finally</span> <span class="p">{</span>
        <span class="k">this</span><span class="p">.</span><span class="nx">_toggleItem</span><span class="p">.</span><span class="nf">unblock_signal_handler</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">_toggleHandlerId</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">block_signal_handler</code> / <code class="language-plaintext highlighter-rouge">unblock_signal_handler</code> are GObject’s standard “shut up for a moment” pair. The <code class="language-plaintext highlighter-rouge">try/finally</code> is non-negotiable: if <code class="language-plaintext highlighter-rouge">setToggleState</code> ever throws, an unblocked handler is required for the <em>next</em> poll to recover, otherwise the switch goes dead permanently.</p>

<p>This is the kind of bug that doesn’t show up in unit tests because the unit tests can’t import <code class="language-plaintext highlighter-rouge">St</code>. It only shows up when a real human clicks the switch on a real GNOME Shell. Lesson learned: when in doubt, <strong>block the handler before mirroring state</strong>.</p>

<p>The fix shipped in 0.2.0. There’s even a row in the troubleshooting table for it, because I wanted future-me to be able to find it.</p>

<hr />

<h2 id="what-made-it-onto-the-panel">What Made It Onto the Panel</h2>

<p>After a few iterations, the menu settled into this shape:</p>

<table>
  <thead>
    <tr>
      <th>Section</th>
      <th>What it shows</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Self</strong></td>
      <td>This machine’s hostname, MagicDNS short name, OS, online dot. Click copies its Tailscale IPv4.</td>
    </tr>
    <tr>
      <td><strong>Toggle</strong></td>
      <td>A <code class="language-plaintext highlighter-rouge">PopupSwitchMenuItem</code> that runs <code class="language-plaintext highlighter-rouge">tailscale up</code>/<code class="language-plaintext highlighter-rouge">down</code> (via pkexec by default).</td>
    </tr>
    <tr>
      <td><strong>Exit Node ▸</strong></td>
      <td>Every peer reported with <code class="language-plaintext highlighter-rouge">--advertise-exit-node</code>, with a green/grey dot, OS in plain text. Plus a <em>None</em> row to clear.</td>
    </tr>
    <tr>
      <td><strong>Peers ▸</strong></td>
      <td>All peers from <code class="language-plaintext highlighter-rouge">tailscale status --json</code> — dot, name, OS, primary IPv4, tags (<code class="language-plaintext highlighter-rouge">exit</code>, <code class="language-plaintext highlighter-rouge">active</code>). Click copies IPv4 (or MagicDNS name — configurable).</td>
    </tr>
    <tr>
      <td><strong>Quick links</strong></td>
      <td>Admin console, manual refresh, preferences.</td>
    </tr>
    <tr>
      <td><strong>Errors</strong></td>
      <td>When something goes wrong, a notification + a <em>Copy error details</em> item on the menu. Errors are also written to <code class="language-plaintext highlighter-rouge">journalctl</code> with a <code class="language-plaintext highlighter-rouge">[tailscale]</code> prefix.</td>
    </tr>
  </tbody>
</table>

<p>The panel icon flips between connected/disconnected glyphs based on <code class="language-plaintext highlighter-rouge">BackendState</code>. That’s it. No popups, no modal dialogs, no surprise reconfigurations. The whole thing is &lt; 2,000 lines including tests.</p>

<hr />

<h2 id="gotchas-i-hit-along-the-way">Gotchas I Hit Along the Way</h2>

<p>A few things that were less obvious than they should have been:</p>

<h3 id="1-symlink-installs-will-eat-your-source-tree">1. Symlink installs will eat your source tree</h3>

<p><code class="language-plaintext highlighter-rouge">gnome-extensions install</code> copies your zip into <code class="language-plaintext highlighter-rouge">~/.local/share/gnome-shell/extensions/</code>. If you symlink your dev tree there instead (which I do, via <code class="language-plaintext highlighter-rouge">make link</code>), and then you click <em>Uninstall</em> in the GNOME Extensions app — <strong>GNOME deletes the contents of the symlink target</strong>. That is to say: your source tree.</p>

<p>I’ve put a comically aggressive warning in the README and <code class="language-plaintext highlighter-rouge">make link</code> itself prints a reminder. There’s also a <code class="language-plaintext highlighter-rouge">make uninstall</code> that does the right thing (remove the symlink, not the target).</p>

<h3 id="2-shell-48-is-esm-shell-45-is-not">2. Shell 48 is ESM. Shell 45 is not.</h3>

<p>If you’re porting an old extension, <code class="language-plaintext highlighter-rouge">imports.misc.extensionUtils</code> is gone. <code class="language-plaintext highlighter-rouge">Main.panel.addToStatusArea</code> is still there. <code class="language-plaintext highlighter-rouge">St</code>/<code class="language-plaintext highlighter-rouge">Clutter</code>/<code class="language-plaintext highlighter-rouge">Gio</code> you import from <code class="language-plaintext highlighter-rouge">gi://</code>. The <code class="language-plaintext highlighter-rouge">Extension</code> base class has lifecycle methods (<code class="language-plaintext highlighter-rouge">enable</code>, <code class="language-plaintext highlighter-rouge">disable</code>) that you actually have to implement properly because nothing magic happens for you. The migration is mechanical but tedious.</p>

<h3 id="3-giosubprocess-is-your-friend">3. <code class="language-plaintext highlighter-rouge">Gio.Subprocess</code> is your friend</h3>

<p>The naive way to spawn <code class="language-plaintext highlighter-rouge">tailscale</code> is <code class="language-plaintext highlighter-rouge">GLib.spawn_command_line_sync</code>. <strong>Don’t.</strong> It blocks the shell — and “the shell” here is <em>literal</em> GNOME Shell, the thing rendering your entire desktop. A 200ms hang in <code class="language-plaintext highlighter-rouge">tailscale status</code> becomes a 200ms freeze of every window animation on your screen. Use <code class="language-plaintext highlighter-rouge">Gio.Subprocess</code> with <code class="language-plaintext highlighter-rouge">communicate_utf8_async</code>, <code class="language-plaintext highlighter-rouge">await</code> the promise, and never block.</p>

<h3 id="4-adwaita-prefs-windows-are-easier-than-youd-think">4. Adwaita prefs windows are easier than you’d think</h3>

<p>GNOME 42+ extensions can use full Adwaita widgets in <code class="language-plaintext highlighter-rouge">prefs.js</code>. <code class="language-plaintext highlighter-rouge">AdwPreferencesPage</code> + <code class="language-plaintext highlighter-rouge">AdwPreferencesGroup</code> + <code class="language-plaintext highlighter-rouge">AdwActionRow</code>/<code class="language-plaintext highlighter-rouge">AdwSwitchRow</code>/<code class="language-plaintext highlighter-rouge">AdwSpinRow</code> give you a prefs UI that looks identical to GNOME Settings. Hooking each row up to a <code class="language-plaintext highlighter-rouge">Gio.Settings</code> instance is a one-liner per setting (<code class="language-plaintext highlighter-rouge">settings.bind('key', row, 'active', Gio.SettingsBindFlags.DEFAULT)</code>).</p>

<hr />

<h2 id="releasing-it">Releasing It</h2>

<p>The release pipeline is just a Makefile target and a GitHub Actions workflow:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make pack       <span class="c"># produces dist/tailscale@Joshwaamein.github.io.shell-extension.zip</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">make pack</code> runs <code class="language-plaintext highlighter-rouge">make ci</code> first (lint + tests + schema compile), then bundles the extension into the zip layout that <code class="language-plaintext highlighter-rouge">gnome-extensions install --force &lt;zip&gt;</code> accepts. CI uploads the zip as a release asset on every tag. Anyone can install with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gnome-extensions <span class="nb">install</span> <span class="nt">--force</span> tailscale@Joshwaamein.github.io.shell-extension.zip
gnome-extensions <span class="nb">enable </span>tailscale@Joshwaamein.github.io
</code></pre></div></div>

<p>Wayland users have to log out and back in once, because Shell can’t hot-load a new extension on Wayland. X11 users can press <code class="language-plaintext highlighter-rouge">Alt+F2</code>, type <code class="language-plaintext highlighter-rouge">r</code>, and hit Enter — old-school, but it still works.</p>

<hr />

<h2 id="what-id-do-differently">What I’d Do Differently</h2>

<p>A few things I’d change if starting again:</p>

<ul>
  <li><strong>Bind state with a tiny store.</strong> I rebuild the entire menu on every poll. That’s fine at 5-second intervals and a dozen peers, but it does mean you sometimes see a flash if you’re hovering an item exactly when the poll completes. A diff-based renderer (or just remembering which submenu was open and reopening it after rebuild) would be nicer.</li>
  <li><strong>Cache <code class="language-plaintext highlighter-rouge">tailscale --version</code> once at startup</strong>, not on every error path. I currently shell out to it whenever I want to render a “your CLI is too old” hint, which is wasteful.</li>
  <li><strong>Push releases to extensions.gnome.org.</strong> Right now you install from the GitHub release zip. e.g.o. has a review process I haven’t bothered with yet.</li>
</ul>

<hr />

<h2 id="the-result">The Result</h2>

<p>I haven’t typed <code class="language-plaintext highlighter-rouge">tailscale status</code> in a terminal for weeks. Toggling the daemon is a click. Copying a peer’s IPv4 is a click. Picking an exit node when I’m on hotel Wi-Fi is two clicks. None of it changes my daemon configuration unless I’ve explicitly opted in. And when something does go wrong — daemon down, CLI missing, login expired — the panel tells me what specifically is broken instead of vaguely failing.</p>

<p>It’s open source, GPL-2.0-or-later (same family as GNOME Shell itself), and lives at <a href="https://github.com/Joshwaamein/gnome-tailscale">github.com/Joshwaamein/gnome-tailscale</a>. PRs welcome — there’s a <a href="https://github.com/Joshwaamein/gnome-tailscale/blob/main/CONTRIBUTING.md"><code class="language-plaintext highlighter-rouge">CONTRIBUTING.md</code></a> and the <code class="language-plaintext highlighter-rouge">lib/util.js</code> split means new logic comes with tests.</p>

<p>Sometimes the tool you want is a 2,000-line Saturday project away.</p>]]></content><author><name>Joshua Mein</name></author><category term="Code" /><category term="Linux" /><category term="tailscale" /><category term="gnome" /><category term="linux" /><category term="javascript" /><category term="gjs" /><category term="vpn" /><category term="automation" /><category term="networking" /><summary type="html"><![CDATA[Why GNOME's built-in VPN panel can't drive Tailscale, how I shipped a small GJS extension to fix that across GNOME Shell 48–50, and the signal-handler re-fire bug that made my toggle behave like a one-shot fuse.]]></summary></entry><entry><title type="html">How I Fixed SSL Certificate Warnings Across My Entire Proxmox Homelab — With Full Auto-Renewal and Email Alerts</title><link href="https://joshwaamein.github.io/posts/proxmox-wildcard-cert-letsencrypt-dns01/" rel="alternate" type="text/html" title="How I Fixed SSL Certificate Warnings Across My Entire Proxmox Homelab — With Full Auto-Renewal and Email Alerts" /><published>2026-04-26T00:00:00+01:00</published><updated>2026-04-26T00:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/proxmox-wildcard-cert-letsencrypt-dns01</id><content type="html" xml:base="https://joshwaamein.github.io/posts/proxmox-wildcard-cert-letsencrypt-dns01/"><![CDATA[<p>If you run a Proxmox homelab, you know the drill. You open your PVE or PBS web UI and Chrome hits you with the red “Your connection is not private” screen. You click Advanced, you click Proceed, and you feel slightly bad about it. Every. Single. Time.</p>

<p>I finally fixed it — for all my servers at once, fully automated, with email alerts on every renewal. Here’s the complete guide including the gotcha that broke my backups immediately after, and how I fixed that too.</p>

<hr />

<h2 id="my-setup">My Setup</h2>

<ul>
  <li>3 × Proxmox VE nodes</li>
  <li>4 × Proxmox Backup Server nodes</li>
  <li>All private, accessible via Tailscale only</li>
  <li>Domain managed by Cloudflare</li>
</ul>

<h2 id="why-standard-lets-encrypt-doesnt-work-here">Why Standard Let’s Encrypt Doesn’t Work Here</h2>

<p>The usual HTTP-01 challenge requires your server to be reachable on port 80 from the internet. My servers are behind Tailscale — they’re not reachable from the internet at all. HTTP-01 is a non-starter.</p>

<p>The answer is the <strong>DNS-01 challenge</strong>. You prove domain ownership by creating a TXT record in your DNS zone instead. Let’s Encrypt checks the TXT record, issues the cert, and your server never needs to be publicly accessible. If your DNS is managed by Cloudflare (or most other major providers), this is fully automatable.</p>

<h2 id="the-wildcard-strategy">The Wildcard Strategy</h2>

<p>Rather than getting individual certificates — separate challenges, separate renewal timers, separate deploy jobs — I issued a single <strong>wildcard certificate</strong> for <code class="language-plaintext highlighter-rouge">*.yourdomain.com</code>.</p>

<p>One cert. One renewal. One deploy script. Covers every subdomain on the domain.</p>

<hr />

<h2 id="step-1-install-acmesh">Step 1: Install acme.sh</h2>

<p><a href="https://github.com/acmesh-official/acme.sh">acme.sh</a> is a shell script ACME client with native Cloudflare support. Install on your management machine (wherever you SSH from):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl https://get.acme.sh | sh <span class="nt">-s</span> <span class="nv">email</span><span class="o">=</span>your@email.com
</code></pre></div></div>

<p>This installs to <code class="language-plaintext highlighter-rouge">~/.acme.sh/</code> and adds a daily cron job automatically.</p>

<h2 id="step-2-create-a-cloudflare-api-token">Step 2: Create a Cloudflare API Token</h2>

<p>In Cloudflare: <strong>My Profile → API Tokens → Create Token → Edit zone DNS</strong>.</p>

<p>Scope it tightly:</p>
<ul>
  <li>Permissions: <code class="language-plaintext highlighter-rouge">Zone → DNS → Edit</code></li>
  <li>Zone Resources: <code class="language-plaintext highlighter-rouge">Include → Specific zone → yourdomain.com</code></li>
</ul>

<p>You also need your Zone ID from the Cloudflare dashboard Overview page.</p>

<h2 id="step-3-issue-the-wildcard-cert">Step 3: Issue the Wildcard Cert</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">CF_Token</span><span class="o">=</span><span class="s2">"your-cloudflare-api-token"</span>
<span class="nb">export </span><span class="nv">CF_Zone_ID</span><span class="o">=</span><span class="s2">"your-zone-id"</span>

~/.acme.sh/acme.sh <span class="nt">--issue</span> <span class="se">\</span>
  <span class="nt">--dns</span> dns_cf <span class="se">\</span>
  <span class="nt">-d</span> <span class="s2">"*.yourdomain.com"</span> <span class="se">\</span>
  <span class="nt">--server</span> letsencrypt
</code></pre></div></div>

<p>acme.sh:</p>
<ol>
  <li>Creates <code class="language-plaintext highlighter-rouge">_acme-challenge.yourdomain.com</code> TXT record via Cloudflare API</li>
  <li>Waits for DNS propagation</li>
  <li>Asks Let’s Encrypt to verify it</li>
  <li>Gets your cert</li>
  <li>Deletes the TXT record</li>
</ol>

<p>Takes about 40 seconds. No ports opened, no firewall changes. The cert lands at:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.acme.sh/*.yourdomain.com_ecc/
├── *.yourdomain.com.key      # private key
├── *.yourdomain.com.cer      # certificate
├── ca.cer                    # intermediate CA
└── fullchain.cer             # cert + chain (use this)
</code></pre></div></div>

<h2 id="step-4-deploy-to-all-servers">Step 4: Deploy to All Servers</h2>

<p>Proxmox VE and PBS both support dropping a cert into a specific path and restarting the proxy service.</p>

<p><strong>PVE nodes (port 8006):</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scp fullchain.cer root@pve1:/etc/pve/local/pveproxy-ssl.pem
scp <span class="k">*</span>.yourdomain.com.key root@pve1:/etc/pve/local/pveproxy-ssl.key
ssh root@pve1 <span class="s2">"systemctl restart pveproxy"</span>
</code></pre></div></div>

<p><strong>PBS nodes (port 8007):</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scp fullchain.cer root@pbs1:/etc/proxmox-backup/proxy.pem
scp <span class="k">*</span>.yourdomain.com.key root@pbs1:/etc/proxmox-backup/proxy.key
ssh root@pbs1 <span class="s2">"systemctl restart proxmox-backup-proxy"</span>
</code></pre></div></div>

<blockquote>
  <p><strong>PVE gotcha:</strong> <code class="language-plaintext highlighter-rouge">/etc/pve/</code> is a FUSE filesystem called pmxcfs. If you try to <code class="language-plaintext highlighter-rouge">chmod</code> the cert files you’ll get “Operation not permitted”. This is normal and harmless — ignore it.</p>
</blockquote>

<p>I scripted this to loop over all servers. Total deploy time: ~30 seconds.</p>

<h2 id="the-gotcha-pbs-fingerprints-in-storagecfg">The Gotcha: PBS Fingerprints in storage.cfg</h2>

<p>Here’s the thing nobody mentions. <strong>After deploying the new certs, my PVE nodes couldn’t connect to my PBS servers anymore.</strong></p>

<p>The reason: every PBS storage definition in PVE’s <code class="language-plaintext highlighter-rouge">storage.cfg</code> contains a <code class="language-plaintext highlighter-rouge">fingerprint</code> line — the SHA256 fingerprint of the PBS server’s certificate. PVE uses this to verify it’s talking to the right server:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pbs: pbs1
    server pbs1
    fingerprint fa:f0:14:a5:74:79:e8:...  ← old self-signed cert fingerprint
    username backup@pbs!pve1-backup
</code></pre></div></div>

<p>When we replaced the PBS cert with the new Let’s Encrypt cert, the fingerprint changed. PVE saw the mismatch and refused the connection.</p>

<p><strong>Fix:</strong> update the fingerprint on every PBS storage entry on every PVE node.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Get the new fingerprint from one PBS server</span>
<span class="nv">NEW_FP</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> | openssl s_client <span class="nt">-connect</span> pbs1:8007 2&gt;/dev/null <span class="se">\</span>
  | openssl x509 <span class="nt">-fingerprint</span> <span class="nt">-sha256</span> <span class="nt">-noout</span> 2&gt;/dev/null <span class="se">\</span>
  | <span class="nb">sed</span> <span class="s1">'s/sha256 Fingerprint=//'</span> <span class="se">\</span>
  | <span class="nb">tr</span> <span class="s1">'[:upper:]'</span> <span class="s1">'[:lower:]'</span><span class="si">)</span>

<span class="c"># Update all PBS storages on each PVE node</span>
<span class="k">for </span>storage <span class="k">in</span> <span class="si">$(</span><span class="nb">grep</span> <span class="s1">'^pbs:'</span> /etc/pve/storage.cfg | <span class="nb">awk</span> <span class="s1">'{print $2}'</span><span class="si">)</span><span class="p">;</span> <span class="k">do
  </span>pvesh <span class="nb">set</span> /storage/<span class="nv">$storage</span> <span class="nt">--fingerprint</span> <span class="s2">"</span><span class="nv">$NEW_FP</span><span class="s2">"</span>
<span class="k">done</span>
</code></pre></div></div>

<p>Run this on each PVE node. PBS connections restored immediately.</p>

<p><strong>This needs to happen every time the cert renews.</strong> So I built it into the auto-renewal script.</p>

<h2 id="the-second-gotcha-pbs-sync-job-remotes">The Second Gotcha: PBS Sync Job Remotes</h2>

<p>After fixing the PVE storage fingerprints, I thought I was done. Then my backup PBS node started failing its sync jobs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARNING: certificate fingerprint does not match expected fingerprint!
expected: fa:f0:14:a5:74:79:e8:...
certificate validation failed - Certificate fingerprint was not confirmed.
</code></pre></div></div>

<p>That node pulls backups from the other PBS nodes using PBS sync jobs. Those sync jobs connect via <strong>remote definitions</strong> — and remote definitions also store the cert fingerprint. These are completely separate from PVE’s <code class="language-plaintext highlighter-rouge">storage.cfg</code>.</p>

<p>Fix: update the remote definitions on the syncing PBS node:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for </span>remote <span class="k">in </span>pbs1 pbs2 pbs3<span class="p">;</span> <span class="k">do
  </span>proxmox-backup-manager remote update <span class="nv">$remote</span> <span class="nt">--fingerprint</span> <span class="s2">"</span><span class="nv">$NEW_FP</span><span class="s2">"</span>
<span class="k">done</span>
</code></pre></div></div>

<p>So there are actually <strong>two places</strong> fingerprints need updating after a cert change:</p>
<ol>
  <li><strong>PVE <code class="language-plaintext highlighter-rouge">storage.cfg</code></strong> — for PVE → PBS backup jobs (via <code class="language-plaintext highlighter-rouge">pvesh set /storage/...</code>)</li>
  <li><strong>PBS remote definitions</strong> — for PBS → PBS sync jobs (via <code class="language-plaintext highlighter-rouge">proxmox-backup-manager remote update</code>)</li>
</ol>

<p>Both are now handled by the deploy script.</p>

<h2 id="step-5-the-auto-renewal-script">Step 5: The Auto-Renewal Script</h2>

<p>The script does more than just copy files. Here’s what a production-ready version needs to handle:</p>

<ol>
  <li>Deploy cert to all PVE nodes → restart pveproxy</li>
  <li>Deploy cert to all PBS nodes → restart proxmox-backup-proxy</li>
  <li>Wait for PBS to come back up (poll, don’t just sleep)</li>
  <li>Read the new fingerprint from PBS</li>
  <li>Update PBS storage fingerprints on all PVE nodes (<code class="language-plaintext highlighter-rouge">pvesh set /storage/...</code>)</li>
  <li>Update PBS sync remote fingerprints (<code class="language-plaintext highlighter-rouge">proxmox-backup-manager remote update ...</code>)</li>
  <li>Email on start, success, and failure</li>
</ol>

<p>A few gotchas to avoid:</p>
<ul>
  <li>Use <code class="language-plaintext highlighter-rouge">$HOME</code> not <code class="language-plaintext highlighter-rouge">~</code> — tilde doesn’t always expand in non-interactive cron context</li>
  <li>Use <code class="language-plaintext highlighter-rouge">BatchMode=yes</code> in SSH options — interactive prompts will hang cron indefinitely</li>
  <li>Use a heredoc for the email body, not inline quoting — log content containing apostrophes breaks the command</li>
  <li>Use <code class="language-plaintext highlighter-rouge">set -euo pipefail</code> — fail fast on unexpected errors</li>
  <li>Validate cert files exist before doing anything</li>
</ul>

<p>The key structural pattern:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nb">set</span> <span class="nt">-euo</span> pipefail

<span class="nv">ACME_DIR</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.acme.sh"</span>
<span class="nv">CERT_DIR</span><span class="o">=</span><span class="s2">"</span><span class="nv">$ACME_DIR</span><span class="s2">/*.yourdomain.com_ecc"</span>
<span class="nv">CERT</span><span class="o">=</span><span class="s2">"</span><span class="nv">$CERT_DIR</span><span class="s2">/fullchain.cer"</span>
<span class="nv">KEY</span><span class="o">=</span><span class="s2">"</span><span class="nv">$CERT_DIR</span><span class="s2">/</span><span class="si">$(</span><span class="nb">ls</span> <span class="s2">"</span><span class="nv">$CERT_DIR</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="s1">'\.key$'</span> | <span class="nb">grep</span> <span class="nt">-v</span> fullchain | <span class="nb">head</span> <span class="nt">-1</span><span class="si">)</span><span class="s2">"</span>
<span class="nv">LOG_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$ACME_DIR</span><span class="s2">/deploy-proxmox.log"</span>

<span class="nv">SSH_OPTS</span><span class="o">=</span><span class="s2">"-o ConnectTimeout=15 -o StrictHostKeyChecking=no -o BatchMode=yes"</span>

<span class="c"># Validate cert files exist</span>
<span class="k">if</span> <span class="o">[</span> <span class="o">!</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$CERT</span><span class="s2">"</span> <span class="o">]</span> <span class="o">||</span> <span class="o">[</span> <span class="o">!</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$KEY</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
  </span><span class="nb">echo</span> <span class="s2">"ERROR: Cert files not found"</span> | <span class="nb">tee</span> <span class="nt">-a</span> <span class="s2">"</span><span class="nv">$LOG_FILE</span><span class="s2">"</span><span class="p">;</span> <span class="nb">exit </span>1
<span class="k">fi</span>

<span class="c"># Deploy to all servers...</span>

<span class="c"># Poll PBS until it responds instead of blind sleep</span>
wait_for_port<span class="o">()</span> <span class="o">{</span>
  <span class="nb">local </span><span class="nv">host</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> <span class="nv">port</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span> <span class="nb">timeout</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">3</span><span class="k">:-</span><span class="nv">30</span><span class="k">}</span><span class="s2">"</span> <span class="nv">elapsed</span><span class="o">=</span>0
  <span class="k">while</span> <span class="o">!</span> <span class="nb">echo</span> | openssl s_client <span class="nt">-connect</span> <span class="s2">"</span><span class="nv">$host</span><span class="s2">:</span><span class="nv">$port</span><span class="s2">"</span> 2&gt;/dev/null | <span class="nb">grep</span> <span class="nt">-q</span> <span class="s1">'BEGIN CERTIFICATE'</span><span class="p">;</span> <span class="k">do
    </span><span class="nb">sleep </span>2<span class="p">;</span> <span class="nv">elapsed</span><span class="o">=</span><span class="k">$((</span>elapsed <span class="o">+</span> <span class="m">2</span><span class="k">))</span>
    <span class="o">[</span> <span class="nv">$elapsed</span> <span class="nt">-ge</span> <span class="nv">$timeout</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="k">return </span>1
  <span class="k">done</span>
<span class="o">}</span>

<span class="c"># Read new fingerprint and update both locations</span>
<span class="nv">NEW_FP</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> | openssl s_client <span class="nt">-connect</span> pbs1:8007 2&gt;/dev/null <span class="se">\</span>
  | openssl x509 <span class="nt">-fingerprint</span> <span class="nt">-sha256</span> <span class="nt">-noout</span> 2&gt;/dev/null <span class="se">\</span>
  | <span class="nb">sed</span> <span class="s1">'s/sha256 Fingerprint=//'</span> | <span class="nb">tr</span> <span class="s1">'[:upper:]'</span> <span class="s1">'[:lower:]'</span><span class="si">)</span>

<span class="c"># 1. PVE storage.cfg fingerprints</span>
<span class="k">for </span>host <span class="k">in</span> <span class="nv">$PVE_NODES</span><span class="p">;</span> <span class="k">do
  </span>ssh <span class="nv">$SSH_OPTS</span> root@<span class="nv">$host</span> <span class="s2">"
    for s in </span><span class="se">\$</span><span class="s2">(grep '^pbs:' /etc/pve/storage.cfg | awk '{print </span><span class="se">\$</span><span class="s2">2}'); do
      pvesh set /storage/</span><span class="se">\$</span><span class="s2">s --fingerprint '</span><span class="nv">$NEW_FP</span><span class="s2">' 2&gt;/dev/null
    done"</span>
<span class="k">done</span>

<span class="c"># 2. PBS sync remote fingerprints</span>
ssh <span class="nv">$SSH_OPTS</span> root@<span class="nv">$PBS_SYNC_NODE</span> <span class="s2">"
  for remote in pbs1 pbs2 pbs3; do
    proxmox-backup-manager remote update </span><span class="se">\$</span><span class="s2">remote --fingerprint '</span><span class="nv">$NEW_FP</span><span class="s2">' 2&gt;/dev/null
  done"</span>
</code></pre></div></div>

<p>Register it with acme.sh:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.acme.sh/acme.sh <span class="nt">--install-cert</span> <span class="nt">-d</span> <span class="s2">"*.yourdomain.com"</span> <span class="se">\</span>
  <span class="nt">--reloadcmd</span> <span class="s2">"~/.acme.sh/deploy-proxmox.sh"</span>
</code></pre></div></div>

<p>Now every time acme.sh renews the cert (automatically, ~60 days in), this script runs and handles the entire chain.</p>

<h2 id="step-6-email-notifications">Step 6: Email Notifications</h2>

<p>I wanted to know when this ran — success or failure. My management machine doesn’t have sendmail configured, but my PVE nodes do (via msmtp + Brevo). I added a <code class="language-plaintext highlighter-rouge">send_email()</code> function to the deploy script that SSH’s into pve1 to relay the email:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>send_email<span class="o">()</span> <span class="o">{</span>
  <span class="nb">local </span><span class="nv">subject</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
  <span class="nb">local </span><span class="nv">body</span><span class="o">=</span><span class="s2">"</span><span class="nv">$2</span><span class="s2">"</span>
  ssh root@pve1 <span class="se">\</span>
    <span class="s2">"printf 'Subject: %s</span><span class="se">\n</span><span class="s2">From: proxmox-alerts@yourdomain.com</span><span class="se">\n</span><span class="s2">To: you@email.com</span><span class="se">\n\n</span><span class="s2">%s</span><span class="se">\n</span><span class="s2">' </span><span class="se">\</span><span class="s2">
    '</span><span class="nv">$subject</span><span class="s2">' '</span><span class="nv">$body</span><span class="s2">' | /usr/sbin/sendmail -f proxmox-alerts@yourdomain.com you@email.com"</span>
<span class="o">}</span>

<span class="c"># At start:</span>
send_email <span class="s2">"🔄 [proxmox] Cert renewal started"</span> <span class="s2">"Deploy started at </span><span class="si">$(</span><span class="nb">date</span><span class="si">)</span><span class="s2">"</span>

<span class="c"># At end (success):</span>
send_email <span class="s2">"✅ [proxmox] Cert renewal succeeded"</span> <span class="s2">"</span><span class="nv">$LOG</span><span class="se">\n</span><span class="s2">New fingerprint: </span><span class="nv">$NEW_FP</span><span class="s2">"</span>

<span class="c"># On failure:</span>
send_email <span class="s2">"❌ [proxmox] Cert renewal FAILED"</span> <span class="s2">"</span><span class="nv">$LOG</span><span class="s2">"</span>
</code></pre></div></div>

<blockquote>
  <p>If you haven’t set up SMTP on your PVE nodes yet, I covered that in <a href="/posts/why-i-switched-from-gmail-to-brevo-for-homelab-email-alerts/">Why I Switched From Gmail to Brevo for All My Homelab Email Alerts</a>.</p>
</blockquote>

<h2 id="step-7-dns-records">Step 7: DNS Records</h2>

<p>The wildcard cert covers <code class="language-plaintext highlighter-rouge">*.yourdomain.com</code>, but for your browser to reach <code class="language-plaintext highlighter-rouge">pve2.yourdomain.com</code> it needs a DNS A record.</p>

<p>Create records in Cloudflare pointing to your Tailscale IPs, with proxying <strong>disabled</strong>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pve1.yourdomain.com  A  100.x.x.x  (proxied: off, TTL: 3600)
pve2.yourdomain.com  A  100.x.x.x
pbs1.yourdomain.com  A  100.x.x.x
...
</code></pre></div></div>

<p>Tailscale IPs are only routable within your Tailnet. The records are technically public, but anyone outside your network who looks them up gets an IP they can’t reach. It’s security through inaccessibility.</p>

<p>If you’d rather have zero public DNS footprint, add entries to your local <code class="language-plaintext highlighter-rouge">/etc/hosts</code> instead:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>100.x.x.x  pve2.yourdomain.com
</code></pre></div></div>

<hr />

<h2 id="the-full-automated-flow">The Full Automated Flow</h2>

<p>Every ~60 days, without any manual intervention:</p>

<ol>
  <li><strong>acme.sh cron fires</strong> (daily at a random time, checks if renewal needed)</li>
  <li><strong>DNS-01 challenge</strong> runs — temporary TXT record created and deleted via Cloudflare API</li>
  <li><strong>New cert issued</strong> by Let’s Encrypt</li>
  <li><strong>deploy-proxmox.sh runs:</strong>
    <ul>
      <li>🔄 Start email sent</li>
      <li>New cert deployed to all servers via SCP</li>
      <li>All proxies restarted</li>
      <li>New fingerprint read from PBS</li>
      <li>All PBS storage fingerprints updated on PVE nodes</li>
      <li>PBS sync remote fingerprints updated</li>
      <li>✅ Success email sent with full log + fingerprint + expiry date</li>
      <li>❌ Failure email sent if anything went wrong</li>
    </ul>
  </li>
</ol>

<p>Zero manual steps required. You get notified either way.</p>

<hr />

<h2 id="summary">Summary</h2>

<table>
  <thead>
    <tr>
      <th>What</th>
      <th>How</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cert type</td>
      <td>Let’s Encrypt wildcard <code class="language-plaintext highlighter-rouge">*.yourdomain.com</code></td>
    </tr>
    <tr>
      <td>ACME challenge</td>
      <td>DNS-01 via Cloudflare API</td>
    </tr>
    <tr>
      <td>Client</td>
      <td>acme.sh</td>
    </tr>
    <tr>
      <td>Deployment</td>
      <td>scp + systemctl restart</td>
    </tr>
    <tr>
      <td>Fingerprint update</td>
      <td>pvesh set /storage on all PVE nodes</td>
    </tr>
    <tr>
      <td>Email alerts</td>
      <td>msmtp relay via PVE node</td>
    </tr>
    <tr>
      <td>Auto-renewal</td>
      <td>acme.sh cron + custom deploy hook</td>
    </tr>
    <tr>
      <td>Time to set up</td>
      <td>~15 minutes</td>
    </tr>
    <tr>
      <td>Ongoing maintenance</td>
      <td>None</td>
    </tr>
  </tbody>
</table>

<p>The PBS fingerprint step is the non-obvious part that will break your backups if you miss it. Build it into your deploy script from the start and you’ll never have to think about it again.</p>]]></content><author><name>Joshua Mein</name></author><category term="Homelab" /><category term="Security" /><category term="proxmox" /><category term="linux" /><category term="ssl" /><category term="tls" /><category term="letsencrypt" /><category term="acme" /><category term="cloudflare" /><category term="dns" /><category term="tailscale" /><category term="automation" /><category term="certificates" /><category term="proxmox-backup-server" /><summary type="html"><![CDATA[A complete guide to issuing a Let's Encrypt wildcard cert via DNS-01, deploying it to all Proxmox VE and PBS nodes, handling the PBS fingerprint gotcha that will break your backups, and wiring up email alerts on every renewal.]]></summary></entry><entry><title type="html">Running ComfyUI on an AMD RX 7900 XTX — Native ROCm 7.1 on Windows</title><link href="https://joshwaamein.github.io/posts/comfyui-amd-rx7900xtx-rocm-windows/" rel="alternate" type="text/html" title="Running ComfyUI on an AMD RX 7900 XTX — Native ROCm 7.1 on Windows" /><published>2026-04-05T00:00:00+01:00</published><updated>2026-04-05T00:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/comfyui-amd-rx7900xtx-rocm-windows</id><content type="html" xml:base="https://joshwaamein.github.io/posts/comfyui-amd-rx7900xtx-rocm-windows/"><![CDATA[<p><em>AMD ROCm 7.1 now runs natively on Windows. Here’s how I used it to get ComfyUI running on a gaming PC with an RX 7900 XTX — no Zluda, no translation layer, full GPU acceleration.</em></p>

<hr />

<h2 id="the-problem">The Problem</h2>

<p>My main machine is a Windows gaming PC with an AMD RX 7900 XTX (24GB VRAM). I can’t switch to Linux because of kernel-level anti-cheat — Riot Vanguard, EasyAntiCheat, BattlEye. These don’t run under Wine or Proton.</p>

<p>The traditional options for running ComfyUI on AMD hardware on Windows were:</p>

<ul>
  <li><strong>DirectML</strong> — works, but significantly slower than ROCm or CUDA. Not viable for video generation.</li>
  <li><strong>Zluda</strong> — a CUDA translation layer for AMD. Works for some models, but requires specific forks, is fragile, and adds complexity.</li>
  <li><strong>ROCm on Linux</strong> — the gold standard, but requires dual-booting or a separate machine.</li>
</ul>

<p>Then AMD shipped <strong>ROCm 7.1 for Windows</strong> in late 2025. <code class="language-plaintext highlighter-rouge">torch.cuda.is_available()</code> returns <code class="language-plaintext highlighter-rouge">True</code> on the RX 7900 XTX. The full pipeline runs natively on GPU.</p>

<hr />

<h2 id="whats-already-required">What’s Already Required</h2>

<p>Before starting, you need:</p>

<ul>
  <li><strong>AMD HIP SDK 7.1</strong> installed — available from <a href="https://www.amd.com/en/developer/rocm-hub/hip-sdk.html">AMD’s developer site</a>. The installer sets <code class="language-plaintext highlighter-rouge">HIP_PATH</code> as a system environment variable automatically.</li>
  <li><strong>AMD Adrenalin driver 25.20.01.17 or newer</strong> — the preview driver that enables ROCm on Windows. Check AMD’s release notes for the latest.</li>
  <li><strong>Python 3.12</strong> — the ROCm PyTorch wheels are built for cp312 specifically.</li>
  <li><strong>Git</strong> — for cloning ComfyUI and custom nodes.</li>
</ul>

<p>You can verify your HIP SDK is installed:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">echo</span><span class="w"> </span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">HIP_PATH</span><span class="w">
</span><span class="c"># Should output: C:\Program Files\AMD\ROCm\7.1\</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="installing-uv">Installing uv</h2>

<p>I use <a href="https://github.com/astral-sh/uv">uv</a> as the package manager — it’s significantly faster than pip for large installs like the ROCm SDK wheels (which are several GB).</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">powershell</span><span class="w"> </span><span class="nt">-ExecutionPolicy</span><span class="w"> </span><span class="nx">ByPass</span><span class="w"> </span><span class="nt">-c</span><span class="w"> </span><span class="s2">"irm https://astral.sh/uv/install.ps1 | iex"</span><span class="w">
</span></code></pre></div></div>

<p>uv installs to <code class="language-plaintext highlighter-rouge">C:\Users\&lt;you&gt;\.local\bin\</code>. Since each terminal session won’t have it on PATH yet, I reference it by full path throughout this guide.</p>

<hr />

<h2 id="cloning-comfyui">Cloning ComfyUI</h2>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">git</span><span class="w"> </span><span class="nx">clone</span><span class="w"> </span><span class="nx">https://github.com/comfyanonymous/ComfyUI.git</span><span class="w"> </span><span class="nx">O:\ComfyUI</span><span class="w">
</span></code></pre></div></div>

<p>I’m installing to <code class="language-plaintext highlighter-rouge">O:\ComfyUI</code> — a dedicated SSD with plenty of space. Models alone can be 10–50GB+, so pick a drive accordingly.</p>

<hr />

<h2 id="creating-the-python-environment">Creating the Python Environment</h2>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">venv</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv</span><span class="w"> </span><span class="nt">--python</span><span class="w"> </span><span class="nx">3.12</span><span class="w">
</span></code></pre></div></div>

<p>Note: <code class="language-plaintext highlighter-rouge">uv venv</code> needs an absolute path to the target directory, not a relative one, when running from a different drive.</p>

<hr />

<h2 id="installing-rocm-sdk-wheels">Installing ROCm SDK Wheels</h2>

<p>AMD publishes ROCm Python wheels at <code class="language-plaintext highlighter-rouge">repo.radeon.com</code>. Install the SDK first:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nt">--no-cache</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">--python</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_core-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_devel-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_libraries_custom-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm-0.1.dev0.tar.gz</span><span class="w">
</span></code></pre></div></div>

<p>This downloads ~3.3GB. The <code class="language-plaintext highlighter-rouge">--no-cache</code> flag is important here — uv’s cache is on C: by default, and these wheels are large enough that you don’t want them cached if C: is tight.</p>

<hr />

<h2 id="installing-rocm-pytorch">Installing ROCm PyTorch</h2>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nt">--no-cache</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">--python</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torch-2.9.0</span><span class="o">+</span><span class="nx">rocmsdk20251116-cp312-cp312-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchaudio-2.9.0</span><span class="o">+</span><span class="nx">rocmsdk20251116-cp312-cp312-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchvision-0.24.0</span><span class="o">+</span><span class="nx">rocmsdk20251116-cp312-cp312-win_amd64.whl</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="installing-comfyui-requirements">Installing ComfyUI Requirements</h2>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nt">--no-cache</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">--python</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">-r</span><span class="w"> </span><span class="nx">O:\ComfyUI\requirements.txt</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="custom-nodes">Custom Nodes</h2>

<p>I installed four custom nodes for video generation:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">git</span><span class="w"> </span><span class="nx">clone</span><span class="w"> </span><span class="nx">https://github.com/ltdrdata/ComfyUI-Manager</span><span class="w">         </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-Manager</span><span class="w">
</span><span class="n">git</span><span class="w"> </span><span class="nx">clone</span><span class="w"> </span><span class="nx">https://github.com/kijai/ComfyUI-WanVideoWrapper</span><span class="w">     </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper</span><span class="w">
</span><span class="n">git</span><span class="w"> </span><span class="nx">clone</span><span class="w"> </span><span class="nx">https://github.com/Lightricks/ComfyUI-LTXVideo</span><span class="w">       </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-LTXVideo</span><span class="w">
</span><span class="n">git</span><span class="w"> </span><span class="nx">clone</span><span class="w"> </span><span class="nx">https://github.com/kijai/ComfyUI-FramePackWrapper</span><span class="w">    </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-FramePackWrapper</span><span class="w">
</span></code></pre></div></div>

<blockquote>
  <p><strong>Important:</strong> <code class="language-plaintext highlighter-rouge">lllyasviel/FramePack</code> is a standalone Gradio app, not a ComfyUI custom node. It has no <code class="language-plaintext highlighter-rouge">__init__.py</code> and will fail to load. Use <code class="language-plaintext highlighter-rouge">kijai/ComfyUI-FramePackWrapper</code> instead.</p>
</blockquote>

<p>Install their requirements. The first three can be installed together:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nt">--no-cache</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">--python</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">-r</span><span class="w"> </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-Manager\requirements.txt</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">-r</span><span class="w"> </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\requirements.txt</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">-r</span><span class="w"> </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-LTXVideo\requirements.txt</span><span class="w">
</span></code></pre></div></div>

<p>Then FramePackWrapper separately (its requirements are clean and already satisfied):</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">C:\Users\joshu\.local\bin\uv.exe</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nt">--no-cache</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">--python</span><span class="w"> </span><span class="nx">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nt">-r</span><span class="w"> </span><span class="nx">O:\ComfyUI\custom_nodes\ComfyUI-FramePackWrapper\requirements.txt</span><span class="w">
</span></code></pre></div></div>

<blockquote>
  <p><strong>Why separate?</strong> The standalone <code class="language-plaintext highlighter-rouge">lllyasviel/FramePack</code> repo pins <code class="language-plaintext highlighter-rouge">transformers==4.46.2</code>, which conflicts with <code class="language-plaintext highlighter-rouge">ComfyUI-LTXVideo</code> requiring <code class="language-plaintext highlighter-rouge">transformers&gt;=4.50.0</code>. If you accidentally install FramePack’s requirements, uv will refuse to resolve the dependency graph. FramePackWrapper doesn’t have this problem.</p>
</blockquote>

<hr />

<h2 id="launcher-scripts">Launcher Scripts</h2>

<p>The three environment variables below are essential for stable operation on AMD hardware:</p>

<table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>Value</th>
      <th>Effect</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING</code></td>
      <td><code class="language-plaintext highlighter-rouge">1</code></td>
      <td>Saves ~1/3 VRAM, prevents OOM on long video runs</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">HIP_VISIBLE_DEVICES</code></td>
      <td><code class="language-plaintext highlighter-rouge">0</code></td>
      <td>Targets the RX 7900 XTX, ignores Intel iGPU</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">HSA_OVERRIDE_GFX_VERSION</code></td>
      <td><code class="language-plaintext highlighter-rouge">11.0.0</code></td>
      <td>Forces gfx1100 (RDNA3) compatibility</td>
    </tr>
  </tbody>
</table>

<p><code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code> is the most important one. Without it, ROCm caches GPU memory aggressively and you’ll hit OOM errors during 81-frame video generation runs.</p>

<p><strong><code class="language-plaintext highlighter-rouge">O:\ComfyUI\launch_comfyui.ps1</code>:</strong></p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># ComfyUI Launcher for AMD Radeon RX 7900 XTX (ROCm 7.1 / Windows)</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">PYTORCH_NO_HIP_MEMORY_CACHING</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">HIP_VISIBLE_DEVICES</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"0"</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">HSA_OVERRIDE_GFX_VERSION</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"11.0.0"</span><span class="w">

</span><span class="o">&amp;</span><span class="w"> </span><span class="s2">"</span><span class="bp">$PSScriptRoot</span><span class="s2">\.venv\Scripts\Activate.ps1"</span><span class="w">

</span><span class="n">Write-Host</span><span class="w"> </span><span class="s2">"Starting ComfyUI on http://127.0.0.1:8188 ..."</span><span class="w"> </span><span class="nt">-ForegroundColor</span><span class="w"> </span><span class="nx">Cyan</span><span class="w">
</span><span class="o">&amp;</span><span class="w"> </span><span class="s2">"</span><span class="bp">$PSScriptRoot</span><span class="s2">\.venv\Scripts\python.exe"</span><span class="w"> </span><span class="s2">"</span><span class="bp">$PSScriptRoot</span><span class="s2">\main.py"</span><span class="w"> </span><span class="nt">--listen</span><span class="w"> </span><span class="mf">0.0</span><span class="o">.</span><span class="nf">0</span><span class="o">.</span><span class="nf">0</span><span class="w"> </span><span class="nt">--port</span><span class="w"> </span><span class="mi">8188</span><span class="w">
</span></code></pre></div></div>

<p><strong><code class="language-plaintext highlighter-rouge">O:\ComfyUI\launch_comfyui.bat</code></strong> (double-click launcher):</p>

<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@echo <span class="na">off</span>
<span class="kd">powershell</span><span class="err">.exe</span> <span class="na">-ExecutionPolicy </span><span class="kd">Bypass</span> <span class="na">-File </span><span class="s2">"</span><span class="vm">%~dp0</span><span class="s2">launch_comfyui.ps1"</span>
<span class="nb">pause</span>
</code></pre></div></div>

<hr />

<h2 id="validating-the-gpu">Validating the GPU</h2>

<p>Before launching ComfyUI, verify the GPU is detected:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="nt">-c</span><span class="w"> </span><span class="s2">"
import torch
print('Torch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('Device name:', torch.cuda.get_device_name(0))
"</span><span class="w">
</span></code></pre></div></div>

<p>Expected output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[WARNING] failed to run amdgpu-arch: binary not found.
Torch version: 2.9.0+rocmsdk20251116
CUDA available: True
Device name: AMD Radeon RX 7900 XTX
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">amdgpu-arch</code> warning is harmless — it’s a compile-time tool that isn’t needed at runtime.</p>

<p>Run a quick GPU compute test:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">O:\ComfyUI\.venv\Scripts\python.exe</span><span class="w"> </span><span class="nt">-c</span><span class="w"> </span><span class="s2">"
import torch
x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()
z = torch.mm(x, y)
print('GPU matmul OK, sum:', z.sum().item())
"</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="first-launch">First Launch</h2>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">.</span><span class="n">\launch_comfyui.bat</span><span class="w">
</span></code></pre></div></div>

<p>Navigate to <code class="language-plaintext highlighter-rouge">http://127.0.0.1:8188</code>.</p>

<blockquote>
  <p><strong>Note:</strong> Use <code class="language-plaintext highlighter-rouge">127.0.0.1:8188</code>, not <code class="language-plaintext highlighter-rouge">localhost:8188</code>. Chrome sometimes returns a 403 on <code class="language-plaintext highlighter-rouge">localhost</code> due to HSTS preloading.</p>
</blockquote>

<p>ComfyUI startup output confirms everything is working:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytorch version: 2.9.0+rocmsdk20251116
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1100
ROCm version: (7, 1)
Total VRAM 24560 MB, total RAM 32482 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 XTX : native
</code></pre></div></div>

<p>Key things to check:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">AMD arch: gfx1100</code> — correct RDNA3 architecture</li>
  <li><code class="language-plaintext highlighter-rouge">Device: cuda:0 AMD Radeon RX 7900 XTX : native</code> — running natively, not via a translation layer</li>
  <li><code class="language-plaintext highlighter-rouge">Set vram state to: NORMAL_VRAM</code> — 24GB is enough that ComfyUI isn’t in a reduced-VRAM mode</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">comfy-aimdo</code> warning on startup is also harmless — it’s an Nvidia-only optimisation that self-reports as unsupported and skips itself.</p>

<hr />

<h2 id="model-placement">Model Placement</h2>

<p>ComfyUI uses separate folders for each model type. The default LTX-Video workflow that loads on first launch needs three models (19.27 GB total) — click <strong>“Download all”</strong> in the Missing Models dialog and ComfyUI places them automatically.</p>

<p>For manual placement:</p>

<table>
  <thead>
    <tr>
      <th>Model type</th>
      <th>Folder</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Diffusion model (main checkpoint)</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\diffusion_models\</code></td>
    </tr>
    <tr>
      <td>Text encoders (T5, CLIP, Qwen)</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\text_encoders\</code></td>
    </tr>
    <tr>
      <td>VAE</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\vae\</code></td>
    </tr>
    <tr>
      <td>CLIP Vision (for image-to-video)</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\clip_vision\</code></td>
    </tr>
    <tr>
      <td>LoRAs</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\loras\</code></td>
    </tr>
    <tr>
      <td>Upscale models</td>
      <td><code class="language-plaintext highlighter-rouge">O:\ComfyUI\models\upscale_models\</code></td>
    </tr>
  </tbody>
</table>

<h3 id="wan21-i2v-480p">Wan2.1 i2v 480p</h3>

<table>
  <thead>
    <tr>
      <th>File</th>
      <th>Folder</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">wan2.1_i2v_480p_14B_fp8_scaled.safetensors</code></td>
      <td><code class="language-plaintext highlighter-rouge">diffusion_models\</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">umt5-xxl_fp8_e4m3fn.safetensors</code></td>
      <td><code class="language-plaintext highlighter-rouge">text_encoders\</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">wan_2.1_vae.safetensors</code></td>
      <td><code class="language-plaintext highlighter-rouge">vae\</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">clip_vision_h.safetensors</code></td>
      <td><code class="language-plaintext highlighter-rouge">clip_vision\</code></td>
    </tr>
  </tbody>
</table>

<p>Use <strong>ComfyUI-Manager → Model Manager</strong> to download models directly into the correct folders without having to know the paths.</p>

<hr />

<h2 id="performance">Performance</h2>

<p>Benchmarked on RX 7900 XTX, ROCm 7.1, <code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code>:</p>

<table>
  <thead>
    <tr>
      <th>Workflow</th>
      <th>Resolution</th>
      <th>Frames</th>
      <th>Steps</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Wan2.1 i2v</td>
      <td>480×704</td>
      <td>81</td>
      <td>25</td>
      <td>~40 min</td>
    </tr>
    <tr>
      <td>Wan2.1 t2v</td>
      <td>480×704</td>
      <td>81</td>
      <td>25</td>
      <td>~5–6 min</td>
    </tr>
    <tr>
      <td>LTX-Video t2v</td>
      <td>512×512</td>
      <td>25</td>
      <td>20</td>
      <td>~2–3 min</td>
    </tr>
  </tbody>
</table>

<p>These are slow compared to CUDA on equivalent Nvidia hardware, but they work reliably without OOM errors. The DirectML backend is significantly slower still — ROCm is the right path for AMD on Windows.</p>

<hr />

<h2 id="quality-vs-speed-fp8-vs-bf16">Quality vs Speed: FP8 vs BF16</h2>

<p>The models come in different precision variants. Understanding the trade-offs helps you get the most out of 24GB VRAM:</p>

<table>
  <thead>
    <tr>
      <th>Format</th>
      <th>Memory</th>
      <th>Quality</th>
      <th>Best for</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>BF16</td>
      <td>2 bytes/param</td>
      <td>★★★★</td>
      <td>Final renders, maximum detail</td>
    </tr>
    <tr>
      <td>FP8 (scaled)</td>
      <td>1 byte/param</td>
      <td>★★★☆</td>
      <td>Good balance</td>
    </tr>
    <tr>
      <td>FP8 (e4m3fn)</td>
      <td>1 byte/param</td>
      <td>★★★</td>
      <td>Fast iteration, finding compositions</td>
    </tr>
  </tbody>
</table>

<p><strong>Quality ranking:</strong> <code class="language-plaintext highlighter-rouge">bf16 &gt; fp8_scaled &gt; fp8_e4m3fn</code></p>

<p>With 24GB VRAM you can run BF16 variants of most models. The practical workflow I use:</p>

<ol>
  <li><strong>Draft</strong> — fp8 model, 15–20 steps, find a good seed and composition</li>
  <li><strong>Final render</strong> — BF16 model, same seed, 35–50 steps</li>
</ol>

<p>BF16 has FP32-like dynamic range (8-bit exponent) which means fewer NaN/overflow issues and better preservation of fine detail in hair, skin, and fabric. FP8 halves the VRAM requirement, which matters if you want to push to 720p or longer sequences.</p>

<p>If you see banding, posterisation, or loss of micro-detail, switch from <code class="language-plaintext highlighter-rouge">fp8_e4m3fn</code> to <code class="language-plaintext highlighter-rouge">fp8_scaled</code> or BF16.</p>

<hr />

<h2 id="known-issues">Known Issues</h2>

<table>
  <thead>
    <tr>
      <th>Issue</th>
      <th>Fix</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">FramePack</code> fails to load — <code class="language-plaintext highlighter-rouge">__init__.py</code> not found</td>
      <td>Use <code class="language-plaintext highlighter-rouge">kijai/ComfyUI-FramePackWrapper</code>, not <code class="language-plaintext highlighter-rouge">lllyasviel/FramePack</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">transformers==4.46.2</code> conflict when installing FramePack requirements</td>
      <td>Install FramePackWrapper separately; don’t use FramePack’s <code class="language-plaintext highlighter-rouge">requirements.txt</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">uv pip install</code> — “No virtual environment found”</td>
      <td>Use <code class="language-plaintext highlighter-rouge">--python O:\ComfyUI\.venv\Scripts\python.exe</code> explicitly</td>
    </tr>
    <tr>
      <td>Browser 403 on <code class="language-plaintext highlighter-rouge">localhost:8188</code></td>
      <td>Use <code class="language-plaintext highlighter-rouge">http://127.0.0.1:8188</code> instead</td>
    </tr>
    <tr>
      <td>OOM during 81-frame video generation</td>
      <td>Ensure <code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code> is set before launch</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="lessons-learned">Lessons Learned</h2>

<ol>
  <li>
    <p><strong>ROCm on Windows works now.</strong> AMD shipped ROCm 7.1 for Windows in late 2025. <code class="language-plaintext highlighter-rouge">torch.cuda.is_available()</code> returns <code class="language-plaintext highlighter-rouge">True</code> on RDNA3. No Zluda, no translation layer, no Linux required.</p>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code> is essential.</strong> Without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer video runs. This single env var saves roughly a third of VRAM.</p>
  </li>
  <li>
    <p><strong>Use <code class="language-plaintext highlighter-rouge">kijai/ComfyUI-FramePackWrapper</code>, not <code class="language-plaintext highlighter-rouge">lllyasviel/FramePack</code>.</strong> The original FramePack repo is a standalone Gradio app. It has no <code class="language-plaintext highlighter-rouge">__init__.py</code> and will fail to load as a ComfyUI custom node. The kijai wrapper is the correct one.</p>
  </li>
  <li>
    <p><strong>uv needs explicit <code class="language-plaintext highlighter-rouge">--python</code> flags when the venv is on a different drive.</strong> <code class="language-plaintext highlighter-rouge">uv pip install</code> looks for a venv relative to the current working directory. If your venv is on <code class="language-plaintext highlighter-rouge">O:</code> and you’re running from <code class="language-plaintext highlighter-rouge">C:</code>, it won’t find it. Pass <code class="language-plaintext highlighter-rouge">--python O:\ComfyUI\.venv\Scripts\python.exe</code> explicitly.</p>
  </li>
  <li>
    <p><strong>Don’t install FramePack’s standalone <code class="language-plaintext highlighter-rouge">requirements.txt</code>.</strong> It pins <code class="language-plaintext highlighter-rouge">transformers==4.46.2</code>, which conflicts with LTX-Video’s requirement for <code class="language-plaintext highlighter-rouge">&gt;=4.50.0</code>. Install FramePackWrapper’s requirements separately — they’re clean.</p>
  </li>
  <li>
    <p><strong>BF16 for final renders, FP8 for drafts.</strong> With 24GB VRAM you have the headroom to run BF16 models. Use FP8 to find a good seed quickly, then switch to BF16 for the final high-step render.</p>
  </li>
</ol>]]></content><author><name>Joshua Mein</name></author><category term="Code" /><category term="AI" /><category term="python" /><category term="ai" /><category term="amd" /><category term="windows" /><category term="rocm" /><category term="comfyui" /><category term="video-generation" /><category term="stable-diffusion" /><summary type="html"><![CDATA[How I got ComfyUI running natively on an AMD RX 7900 XTX on Windows 11 using ROCm 7.1, without Zluda, with Wan2.1, LTX-Video, and FramePack custom nodes — and the exact steps to replicate it.]]></summary></entry><entry><title type="html">Zero-Shot Voice Cloning on AMD — ROCm 7.1 on Windows, F5-TTS, and the ONNX Fallback</title><link href="https://joshwaamein.github.io/posts/zero-shot-voice-cloning-amd-gpu-windows/" rel="alternate" type="text/html" title="Zero-Shot Voice Cloning on AMD — ROCm 7.1 on Windows, F5-TTS, and the ONNX Fallback" /><published>2026-04-04T22:00:00+01:00</published><updated>2026-04-04T22:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/zero-shot-voice-cloning-amd-gpu-windows</id><content type="html" xml:base="https://joshwaamein.github.io/posts/zero-shot-voice-cloning-amd-gpu-windows/"><![CDATA[<p><em>AMD ROCm 7.1 now runs natively on Windows. Here’s how I used it to build a zero-shot voice cloning pipeline on a gaming machine that can’t switch to Linux.</em></p>

<hr />

<h2 id="the-setup">The Setup</h2>

<p>My main machine is a Windows gaming PC with an AMD RX 7900 XTX. I can’t switch to Linux because I play games with kernel-level anti-cheat — Riot Vanguard, EasyAntiCheat, BattlEye. These systems require Windows and won’t run under Wine, Proton, or any compatibility layer. Dual-booting is theoretically possible but kills any iterative AI workflow.</p>

<p>The goal: <strong>zero-shot voice cloning on GPU, on Windows, with AMD hardware.</strong></p>

<p>Zero-shot means no fine-tuning — you give the model a short reference clip of any speaker, and it synthesises new speech in their voice. The model I chose is <strong>F5-TTS</strong>, a flow-matching TTS model that does this well and is fully open source.</p>

<hr />

<h2 id="the-journey-short-version">The Journey (Short Version)</h2>

<p>Before ROCm on Windows existed, I went through several dead ends:</p>

<ol>
  <li><strong>torch-directml</strong> — DirectML doesn’t support <code class="language-plaintext highlighter-rouge">ComplexFloat</code> (FFT ops). F5-TTS uses STFT for mel spectrograms. Fatal incompatibility.</li>
  <li><strong>VMware PCIe passthrough</strong> — <code class="language-plaintext highlighter-rouge">NOT_IMPLEMENTED</code> on Windows hosts. Linux host required.</li>
  <li><strong>ROCm on Windows</strong> — didn’t exist. PyTorch ROCm wheels were Linux-only.</li>
  <li><strong>ZLUDA</strong> — CUDA compatibility layer for AMD. <code class="language-plaintext highlighter-rouge">torch.stft</code> explicitly broken.</li>
</ol>

<p>The workaround I built was an <strong>ONNX + DirectML hybrid</strong> — export F5-TTS to three ONNX models, run the transformer on DirectML GPU and the FFT-heavy preprocessing/decode on CPU. It worked, but it was a compromise.</p>

<p>Then AMD shipped <strong>ROCm 7.1 for Windows</strong>.</p>

<hr />

<h2 id="rocm-71-on-windows--the-real-solution">ROCm 7.1 on Windows — The Real Solution</h2>

<p>AMD’s HIP SDK for Windows is now available at <code class="language-plaintext highlighter-rouge">repo.radeon.com</code>, and PyTorch 2.9.0 ROCm wheels are included. <code class="language-plaintext highlighter-rouge">torch.cuda.is_available()</code> returns <code class="language-plaintext highlighter-rouge">True</code> on the RX 7900 XTX. The full pipeline — mel spectrogram, transformer, vocoder — runs on GPU.</p>

<h3 id="setting-up-the-rocm-venv">Setting Up the ROCm Venv</h3>

<p>Create a dedicated virtual environment (keep it separate from your main Python env):</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use Python 3.12</span><span class="w">
</span><span class="n">python</span><span class="w"> </span><span class="nt">-m</span><span class="w"> </span><span class="nx">venv</span><span class="w"> </span><span class="nx">venv_rocm</span><span class="w">
</span></code></pre></div></div>

<p>Install the ROCm SDK and PyTorch from AMD’s repo:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">.</span><span class="n">\venv_rocm\Scripts\python.exe</span><span class="w"> </span><span class="nt">-m</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm-0.1.dev0.tar.gz</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_core-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_devel-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/rocm_sdk_libraries_custom-0.1.dev0-py3-none-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torch-2.9.0</span><span class="o">+</span><span class="nx">rocmsdk20251116-cp312-cp312-win_amd64.whl</span><span class="w"> </span><span class="se">`
</span><span class="w">  </span><span class="nx">https://repo.radeon.com/rocm/windows/rocm-rel-7.1.1/torchaudio-2.9.0</span><span class="o">+</span><span class="nx">rocmsdk20251116-cp312-cp312-win_amd64.whl</span><span class="w">
</span></code></pre></div></div>

<p>Install f5-tts and dependencies:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">.</span><span class="n">\venv_rocm\Scripts\python.exe</span><span class="w"> </span><span class="nt">-m</span><span class="w"> </span><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="nx">f5-tts</span><span class="w"> </span><span class="nx">soundfile</span><span class="w"> </span><span class="nx">pydub</span><span class="w"> </span><span class="nx">pyyaml</span><span class="w"> </span><span class="nx">numpy</span><span class="w">
</span></code></pre></div></div>

<p>Verify the GPU is detected:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">torch</span>
<span class="nf">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">__version__</span><span class="p">)</span>          <span class="c1"># 2.9.0+rocmsdk20251116
</span><span class="nf">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="nf">is_available</span><span class="p">())</span>  <span class="c1"># True
</span><span class="nf">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="nf">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>  <span class="c1"># AMD Radeon RX 7900 XTX
</span></code></pre></div></div>

<h3 id="required-environment-variables">Required Environment Variables</h3>

<p>ROCm on Windows needs three env vars set before running. I put these in a launcher script:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># scripts/launch_voice_rocm.ps1</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">PYTORCH_NO_HIP_MEMORY_CACHING</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1"</span><span class="w">   </span><span class="c"># saves ~1/3 VRAM, prevents OOM</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">HIP_VISIBLE_DEVICES</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"0"</span><span class="w">              </span><span class="c"># target RX 7900 XTX, ignore iGPU</span><span class="w">
</span><span class="nv">$</span><span class="nn">env</span><span class="p">:</span><span class="nv">HSA_OVERRIDE_GFX_VERSION</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"11.0.0"</span><span class="w">   </span><span class="c"># force gfx1100 (RDNA3) compatibility</span><span class="w">
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code> is particularly important — without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer runs.</p>

<h3 id="compatibility-patches">Compatibility Patches</h3>

<p>ROCm 7.1 + PyTorch 2.9 + f5-tts 1.1.18 required four patches to work together. None are fundamental issues — they’re version incompatibilities that will be fixed upstream:</p>

<table>
  <thead>
    <tr>
      <th>File</th>
      <th>Issue</th>
      <th>Fix</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">encodec/distrib.py</code></td>
      <td><code class="language-plaintext highlighter-rouge">torch.distributed.ReduceOp</code> moved in PyTorch 2.9</td>
      <td><code class="language-plaintext highlighter-rouge">try/except</code> fallback</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">torchaudio/__init__.py</code></td>
      <td>torchaudio 2.9 requires torchcodec (no Windows DLLs)</td>
      <td>soundfile fallback</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">f5_tts/model/cfm.py</code></td>
      <td>Sway sampling produces duplicate ODE timesteps</td>
      <td><code class="language-plaintext highlighter-rouge">torch.unique()</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">f5_tts/infer/utils_infer.py</code></td>
      <td><code class="language-plaintext highlighter-rouge">ThreadPoolExecutor</code> causes tensor size mismatches</td>
      <td>Sequential loop</td>
    </tr>
  </tbody>
</table>

<p>The torchaudio patch is the most interesting — torchaudio 2.9 replaced its <code class="language-plaintext highlighter-rouge">load()</code> function with a torchcodec-only implementation, but torchcodec’s Windows DLLs don’t ship with the ROCm build. The fix is a one-line fallback to soundfile:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># torchaudio/__init__.py — patched load()
</span><span class="k">try</span><span class="p">:</span>
    <span class="k">return</span> <span class="nf">load_with_torchcodec</span><span class="p">(</span><span class="n">uri</span><span class="p">,</span> <span class="p">...)</span>
<span class="nf">except </span><span class="p">(</span><span class="nb">ImportError</span><span class="p">,</span> <span class="nb">OSError</span><span class="p">):</span>
    <span class="kn">import</span> <span class="n">soundfile</span> <span class="k">as</span> <span class="n">_sf</span>
    <span class="n">data</span><span class="p">,</span> <span class="n">sample_rate</span> <span class="o">=</span> <span class="n">_sf</span><span class="p">.</span><span class="nf">read</span><span class="p">(</span><span class="nf">str</span><span class="p">(</span><span class="n">uri</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="sh">"</span><span class="s">float32</span><span class="sh">"</span><span class="p">,</span> <span class="n">always_2d</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="nf">from_numpy</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">T</span> <span class="k">if</span> <span class="n">channels_first</span> <span class="k">else</span> <span class="n">data</span><span class="p">),</span> <span class="n">sample_rate</span>
</code></pre></div></div>

<h3 id="running-it">Running It</h3>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Default (NFE=32, fast)</span><span class="w">
</span><span class="o">.</span><span class="n">\scripts\launch_voice_rocm.ps1</span><span class="w">

</span><span class="c"># Higher quality</span><span class="w">
</span><span class="o">.</span><span class="n">\scripts\launch_voice_rocm.ps1</span><span class="w"> </span><span class="nt">--nfe</span><span class="w"> </span><span class="nx">64</span><span class="w">

</span><span class="c"># Best quality</span><span class="w">
</span><span class="o">.</span><span class="n">\scripts\launch_voice_rocm.ps1</span><span class="w"> </span><span class="nt">--nfe</span><span class="w"> </span><span class="nx">128</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="the-architecture-full-gpu-vs-hybrid">The Architecture: Full GPU vs Hybrid</h2>

<h3 id="rocm-native-full-gpu">ROCm Native (Full GPU)</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Reference Audio + Text
         │
         ▼
┌─────────────────────┐
│  Mel Spectrogram    │  ← ROCm GPU (STFT — works natively!)
│  Text Tokenisation  │
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  F5 Transformer     │  ← ROCm GPU (flow-matching, 32-128 steps)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Vocoder (Vocos)    │  ← ROCm GPU (mel → waveform)
└─────────────────────┘
         │
         ▼
      output.wav
</code></pre></div></div>

<p>Everything runs on GPU. No CPU↔GPU transfers between stages.</p>

<h3 id="onnx--directml-hybrid-fallback">ONNX + DirectML (Hybrid Fallback)</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Reference Audio + Text
         │
         ▼
┌─────────────────────┐
│  F5_Preprocess.onnx │  ← CPU (ComplexFloat/FFT — DirectML can't do this)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ F5_Transformer.onnx │  ← DirectML GPU (pure float ops — works fine)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│   F5_Decode.onnx    │  ← CPU (ISTFT/vocoder — same FFT issue)
└─────────────────────┘
         │
         ▼
      output.wav
</code></pre></div></div>

<p>The preprocessing and decode stages run on CPU because DirectML doesn’t support <code class="language-plaintext highlighter-rouge">ComplexFloat</code> (FFT). Only the transformer runs on GPU.</p>

<hr />

<h2 id="reference-audio-pipeline">Reference Audio Pipeline</h2>

<p>The quality of the output depends heavily on the reference clip. I built an ingest pipeline to automate finding and preparing good clips:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># scripts/ingest.py
# 1. Download from YouTube
</span><span class="n">ydl_opts</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">"</span><span class="s">format</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">bestaudio/best</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">postprocessors</span><span class="sh">"</span><span class="p">:</span> <span class="p">[{</span><span class="sh">"</span><span class="s">key</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">FFmpegExtractAudio</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">preferredcodec</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">wav</span><span class="sh">"</span><span class="p">}],</span>
<span class="p">}</span>

<span class="c1"># 2. Trim to the clean section
</span><span class="n">ffmpeg</span><span class="p">.</span><span class="nf">input</span><span class="p">(</span><span class="n">raw_wav</span><span class="p">,</span> <span class="n">ss</span><span class="o">=</span><span class="n">start_time</span><span class="p">,</span> <span class="n">to</span><span class="o">=</span><span class="n">end_time</span><span class="p">)</span> \
      <span class="p">.</span><span class="nf">output</span><span class="p">(</span><span class="n">trimmed_wav</span><span class="p">,</span> <span class="n">ar</span><span class="o">=</span><span class="mi">22050</span><span class="p">,</span> <span class="n">ac</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> \
      <span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="n">overwrite_output</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="c1"># 3. Transcribe with Whisper
</span><span class="n">model</span> <span class="o">=</span> <span class="n">whisper</span><span class="p">.</span><span class="nf">load_model</span><span class="p">(</span><span class="sh">"</span><span class="s">base</span><span class="sh">"</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="nf">transcribe</span><span class="p">(</span><span class="n">trimmed_wav</span><span class="p">)</span>
<span class="n">transcript</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="sh">"</span><span class="s">text</span><span class="sh">"</span><span class="p">].</span><span class="nf">strip</span><span class="p">()</span>
</code></pre></div></div>

<p>What makes a good reference clip:</p>
<ul>
  <li><strong>6–30 seconds</strong> — long enough for voice characteristics, short enough to avoid drift</li>
  <li><strong>Clean audio</strong> — no background music, minimal reverb, no compression artefacts</li>
  <li><strong>Consistent delivery</strong> — don’t use a clip where the speaker is shouting or whispering</li>
</ul>

<p>For Neil deGrasse Tyson (my test voice), I used an 11.9-second clip from a YouTube lecture, trimmed to a section with clean, energetic speech and no background noise.</p>

<p>The transcript must match the audio exactly — F5-TTS uses it to align voice conditioning. An accurate transcript noticeably improves output quality.</p>

<hr />

<h2 id="configuration">Configuration</h2>

<p>Everything is driven by <code class="language-plaintext highlighter-rouge">config.yaml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">voice</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">neil_degrasse_tyson</span>
  <span class="na">audio_path</span><span class="pi">:</span> <span class="s">reference_audio/neil_degrasse_tyson/ndgt_ref_new.wav</span>
  <span class="na">transcript</span><span class="pi">:</span> <span class="s2">"</span><span class="s">So,</span><span class="nv"> </span><span class="s">here</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">United</span><span class="nv"> </span><span class="s">States,</span><span class="nv"> </span><span class="s">we</span><span class="nv"> </span><span class="s">completely</span><span class="nv"> </span><span class="s">freaked</span><span class="nv"> </span><span class="s">out</span><span class="nv"> </span><span class="s">for</span>
    <span class="s">multiple</span><span class="nv"> </span><span class="s">reasons.</span><span class="nv"> </span><span class="s">First,</span><span class="nv"> </span><span class="s">they</span><span class="nv"> </span><span class="s">beat</span><span class="nv"> </span><span class="s">us</span><span class="nv"> </span><span class="s">at</span><span class="nv"> </span><span class="s">something</span><span class="nv"> </span><span class="s">technological</span><span class="nv"> </span><span class="s">that</span>
    <span class="s">they're</span><span class="nv"> </span><span class="s">not</span><span class="nv"> </span><span class="s">supposed</span><span class="nv"> </span><span class="s">to,</span><span class="nv"> </span><span class="s">because</span><span class="nv"> </span><span class="s">they're</span><span class="nv"> </span><span class="s">like</span><span class="nv"> </span><span class="s">communists."</span>
  <span class="na">language</span><span class="pi">:</span> <span class="s">en</span>

<span class="na">model</span><span class="pi">:</span>
  <span class="na">backend</span><span class="pi">:</span> <span class="s">f5_onnx_dml</span>   <span class="c1"># or f5_rocm via launch_voice_rocm.ps1</span>
  <span class="na">onnx_model_dir</span><span class="pi">:</span> <span class="s">onnx_models/F5-TTS-ONNX-GPU-NFE128-CFG3</span>
  <span class="na">nfe_step</span><span class="pi">:</span> <span class="m">128</span>
  <span class="na">speed</span><span class="pi">:</span> <span class="m">0.75</span>
  <span class="na">device_id</span><span class="pi">:</span> <span class="m">0</span>

<span class="na">output</span><span class="pi">:</span>
  <span class="na">output_dir</span><span class="pi">:</span> <span class="s">outputs/runs</span>
  <span class="na">target_duration</span><span class="pi">:</span> <span class="m">5.0</span>
  <span class="na">silence_thresh_db</span><span class="pi">:</span> <span class="s">-40</span>
  <span class="na">keep_raw</span><span class="pi">:</span> <span class="kc">true</span>

<span class="na">sentences</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s2">"</span><span class="s">The</span><span class="nv"> </span><span class="s">universe</span><span class="nv"> </span><span class="s">is</span><span class="nv"> </span><span class="s">under</span><span class="nv"> </span><span class="s">no</span><span class="nv"> </span><span class="s">obligation</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">make</span><span class="nv"> </span><span class="s">sense</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">you."</span>
  <span class="pi">-</span> <span class="s2">"</span><span class="s">We</span><span class="nv"> </span><span class="s">are</span><span class="nv"> </span><span class="s">all</span><span class="nv"> </span><span class="s">connected</span><span class="nv"> </span><span class="s">—</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">each</span><span class="nv"> </span><span class="s">other,</span><span class="nv"> </span><span class="s">biologically;</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">earth,</span>
    <span class="s">chemically;</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">rest</span><span class="nv"> </span><span class="s">of</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">universe,</span><span class="nv"> </span><span class="s">atomically."</span>
  <span class="pi">-</span> <span class="s2">"</span><span class="s">The</span><span class="nv"> </span><span class="s">good</span><span class="nv"> </span><span class="s">thing</span><span class="nv"> </span><span class="s">about</span><span class="nv"> </span><span class="s">science</span><span class="nv"> </span><span class="s">is</span><span class="nv"> </span><span class="s">that</span><span class="nv"> </span><span class="s">it's</span><span class="nv"> </span><span class="s">true</span><span class="nv"> </span><span class="s">whether</span><span class="nv"> </span><span class="s">or</span><span class="nv"> </span><span class="s">not</span><span class="nv"> </span><span class="s">you</span>
    <span class="s">believe</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">it."</span>
</code></pre></div></div>

<p>Machine-specific paths go in <code class="language-plaintext highlighter-rouge">.env</code> (not committed):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">VOICE_GENERATOR_MODEL_DIR</span><span class="o">=</span>C:<span class="se">\U</span>sers<span class="se">\j</span>oshu<span class="se">\.</span>..<span class="se">\o</span>nnx_models<span class="se">\F</span>5-TTS-ONNX-GPU-NFE128-CFG3
<span class="nv">VOICE_GENERATOR_OUTPUT_DIR</span><span class="o">=</span>C:<span class="se">\U</span>sers<span class="se">\j</span>oshu<span class="se">\.</span>..<span class="se">\o</span>utputs<span class="se">\r</span>uns
</code></pre></div></div>

<hr />

<h2 id="performance">Performance</h2>

<p>Benchmarked on AMD RX 7900 XTX, 10-second reference clip, speed=0.75:</p>

<table>
  <thead>
    <tr>
      <th>Backend</th>
      <th>NFE</th>
      <th>Precision</th>
      <th>Time/clip</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ONNX + DirectML</td>
      <td>128</td>
      <td>FP16</td>
      <td>~33s</td>
      <td>Stable, no SDK needed</td>
    </tr>
    <tr>
      <td>ONNX + DirectML</td>
      <td>256</td>
      <td>FP32</td>
      <td>~64s</td>
      <td>Higher quality</td>
    </tr>
    <tr>
      <td><strong>ROCm native</strong></td>
      <td><strong>32</strong></td>
      <td><strong>FP32</strong></td>
      <td><strong>~10s</strong></td>
      <td><strong>3x faster than ONNX</strong></td>
    </tr>
    <tr>
      <td><strong>ROCm native</strong></td>
      <td><strong>64</strong></td>
      <td><strong>FP32</strong></td>
      <td><strong>~17s</strong></td>
      <td><strong>Sweet spot</strong></td>
    </tr>
    <tr>
      <td><strong>ROCm native</strong></td>
      <td><strong>128</strong></td>
      <td><strong>FP32</strong></td>
      <td><strong>~30s</strong></td>
      <td>Best quality</td>
    </tr>
  </tbody>
</table>

<p><strong>The sweet spot is ROCm native at NFE=64</strong> — 2x better quality than NFE=32, still 2x faster than ONNX+DirectML at equivalent NFE, and the quality improvement from 64→128 is marginal for most use cases.</p>

<p>At NFE=128, ROCm native (~30s) is roughly equivalent to ONNX+DirectML (~33s) in speed, but better in quality because the full pipeline runs in FP32 with no precision loss between stages.</p>

<hr />

<h2 id="project-structure">Project Structure</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>voice_generator/
├── config.yaml                         # All settings
├── .env                                # Machine-specific paths (not committed)
├── .env.example                        # Template
├── scripts/
│   ├── generate_f5_rocm.py             # ROCm native backend
│   ├── generate_f5_onnx_dml.py         # ONNX+DirectML fallback
│   ├── launch_voice_rocm.ps1           # ROCm launcher (sets env vars)
│   ├── ingest.py                       # YouTube → trimmed WAV + transcript
│   └── transcribe.py                   # Whisper transcription
├── lib/
│   ├── audio.py                        # FFmpeg, normalisation, silence trim
│   ├── vocab.py                        # F5-TTS vocabulary handling
│   └── config.py                       # Config dataclasses + loader
├── venv_rocm/                          # ROCm Python environment
├── onnx_models/
│   ├── F5-TTS-ONNX-GPU-NFE128-CFG3/   # ONNX FP16 (DirectML)
│   └── F5-TTS-ONNX-GPU-FP32-NFE256/   # ONNX FP32 (DirectML)
├── reference_audio/
│   └── neil_degrasse_tyson/
│       └── ndgt_ref_new.wav            # 11.9s reference clip
├── outputs/runs/                       # Generated audio
└── tests/                              # 78 pytest unit tests
</code></pre></div></div>

<hr />

<h2 id="the-onnx--directml-fallback">The ONNX + DirectML Fallback</h2>

<p>If you don’t want to install the full ROCm SDK (~3.5GB), the ONNX + DirectML approach still works well. It requires only standard AMD Adrenalin drivers and ONNX Runtime with the DirectML execution provider.</p>

<p>The ONNX models are exported from F5-TTS with NFE and CFG baked in:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># onnx_export/Export_F5.py
</span><span class="n">use_fp16_transformer</span> <span class="o">=</span> <span class="bp">True</span>   <span class="c1"># FP16 for DirectML
</span><span class="n">NFE_STEP</span> <span class="o">=</span> <span class="mi">128</span>
<span class="n">CFG_STRENGTH</span> <span class="o">=</span> <span class="mf">3.0</span>
<span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="sa">r</span><span class="sh">"</span><span class="s">onnx_models\F5-TTS-ONNX-GPU-NFE128-CFG3</span><span class="sh">"</span>
</code></pre></div></div>

<p>The transformer runs on DirectML GPU, preprocessing and decode run on CPU:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># DirectML for transformer
</span><span class="n">ort_session_b</span> <span class="o">=</span> <span class="n">onnxruntime</span><span class="p">.</span><span class="nc">InferenceSession</span><span class="p">(</span>
    <span class="sh">"</span><span class="s">F5_Transformer.onnx</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">providers</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">DmlExecutionProvider</span><span class="sh">"</span><span class="p">],</span>
<span class="p">)</span>

<span class="c1"># CPU for preprocessing and decode
</span><span class="n">ort_session_a</span> <span class="o">=</span> <span class="n">onnxruntime</span><span class="p">.</span><span class="nc">InferenceSession</span><span class="p">(</span>
    <span class="sh">"</span><span class="s">F5_Preprocess.onnx</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">providers</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">CPUExecutionProvider</span><span class="sh">"</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div>

<p>When to use ONNX + DirectML:</p>
<ul>
  <li>You don’t want to install the 3.5GB ROCm SDK</li>
  <li>You need to run on a non-AMD GPU (NVIDIA, Intel — DirectML works on all DirectX 12 GPUs)</li>
  <li>You want FP16 precision to save VRAM</li>
  <li>You need a more stable, less patchy setup</li>
</ul>

<hr />

<h2 id="test-suite">Test Suite</h2>

<p>The project has 78 pytest unit tests:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tests/
├── test_lib_audio.py       # 19 tests
├── test_lib_vocab.py       # 18 tests
├── test_lib_config.py      # 22 tests
├── test_integration_smoke.py  # GPU required
└── test_e2e_full_run.py       # GPU required
</code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytest tests/ <span class="nt">-v</span>          <span class="c"># 78 unit tests, no GPU needed</span>
pytest tests/ <span class="nt">-m</span> integration  <span class="c"># requires DirectML GPU</span>
pytest tests/ <span class="nt">-m</span> e2e          <span class="c"># full pipeline test</span>
</code></pre></div></div>

<hr />

<h2 id="lessons-learned">Lessons Learned</h2>

<ol>
  <li>
    <p><strong>ROCm on Windows works now.</strong> AMD shipped ROCm 7.1 for Windows in late 2025. <code class="language-plaintext highlighter-rouge">torch.cuda.is_available()</code> returns <code class="language-plaintext highlighter-rouge">True</code> on RDNA3. The ecosystem is still maturing but it’s functional.</p>
  </li>
  <li>
    <p><strong>The ONNX hybrid is still worth knowing.</strong> If you don’t want the ROCm SDK overhead, or you need to run on non-AMD hardware, ONNX + DirectML is a solid fallback that works on any DirectX 12 GPU.</p>
  </li>
  <li>
    <p><strong>NFE=64 is the sweet spot for ROCm native.</strong> 2x better quality than NFE=32, still 2x faster than ONNX+DirectML, and the marginal quality gain from 64→128 rarely justifies the 2x time cost.</p>
  </li>
  <li>
    <p><strong>Reference audio quality matters more than model parameters.</strong> A clean 12-second clip beats a noisy 30-second clip every time. Get the transcript right — it directly affects voice conditioning quality.</p>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">PYTORCH_NO_HIP_MEMORY_CACHING=1</code> is essential.</strong> Without it, ROCm caches GPU memory aggressively and you’ll hit OOM on longer runs. This env var saves roughly a third of VRAM.</p>
  </li>
  <li>
    <p><strong>Separate config from machine-specific paths.</strong> Using <code class="language-plaintext highlighter-rouge">.env</code> for absolute paths means the same <code class="language-plaintext highlighter-rouge">config.yaml</code> works on any machine without modification.</p>
  </li>
</ol>]]></content><author><name>Joshua Mein</name></author><category term="Code" /><category term="AI" /><category term="python" /><category term="ai" /><category term="tts" /><category term="amd" /><category term="windows" /><category term="onnx" /><category term="automation" /><category term="rocm" /><summary type="html"><![CDATA[How I got zero-shot voice cloning running on a Windows gaming machine with an AMD RX 7900 XTX — using ROCm 7.1 natively on Windows, with an ONNX+DirectML fallback for when you don't want the full SDK.]]></summary></entry><entry><title type="html">Optimizing Proxmox Backup Server with S3: Regional Migration and Fixing a Glacier Misconfiguration</title><link href="https://joshwaamein.github.io/posts/proxmox-pbs-s3-optimization/" rel="alternate" type="text/html" title="Optimizing Proxmox Backup Server with S3: Regional Migration and Fixing a Glacier Misconfiguration" /><published>2026-04-04T13:00:00+01:00</published><updated>2026-04-04T13:00:00+01:00</updated><id>https://joshwaamein.github.io/posts/proxmox-pbs-s3-optimization</id><content type="html" xml:base="https://joshwaamein.github.io/posts/proxmox-pbs-s3-optimization/"><![CDATA[<p><em>How I migrated my PBS S3 datastore to a closer regional endpoint, resolved a Glacier lifecycle misconfiguration, and properly optimized the setup</em></p>

<hr />

<h2 id="the-most-important-thing-first-glacier-is-incompatible-with-pbs">The Most Important Thing First: Glacier is Incompatible with PBS</h2>

<p>Before anything else — <strong>if you’re running Proxmox Backup Server with an S3 backend, do not use Glacier lifecycle policies.</strong> This includes Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive.</p>

<p>PBS needs immediate, on-demand access to chunks for garbage collection, verification, deduplication, and restores. Glacier storage classes require retrieval requests that can take anywhere from milliseconds to 48 hours depending on tier. The moment PBS tries to access a Glaciered chunk, it fails. This breaks GC, verification, and restores silently or with cryptic errors.</p>

<p>The correct storage class for PBS S3 is <strong>S3 Intelligent-Tiering</strong> — it automatically moves infrequently accessed data to cheaper tiers, but everything remains immediately accessible with no retrieval delays or fees.</p>

<hr />

<h2 id="background">Background</h2>

<p>I run a Proxmox homelab with multiple PVE nodes and PBS servers. One of my PBS servers uses AWS S3 as a backend for offsite backups. PBS 4.x supports S3 as a “technology preview” feature — it uses a local cache disk and syncs chunks to S3.</p>

<p>The setup had been running for several months and had accumulated a number of issues:</p>
<ul>
  <li>Intermittent connection errors (“bytes remaining on stream”, “Transport endpoint not connected”)</li>
  <li>The S3 cache disk was growing without bound</li>
  <li>S3 costs were higher than expected due to Glacier retrieval fees</li>
</ul>

<p>I decided to do a thorough investigation and fix everything properly.</p>

<hr />

<h2 id="the-investigation">The Investigation</h2>

<h3 id="infrastructure-overview">Infrastructure Overview</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Details</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>PBS Server</td>
      <td>VM on Proxmox</td>
    </tr>
    <tr>
      <td>S3 Backend</td>
      <td>AWS S3</td>
    </tr>
    <tr>
      <td>Cache Disk</td>
      <td>850 GB ext4</td>
    </tr>
  </tbody>
</table>

<h3 id="key-findings">Key Findings</h3>

<p><strong>1. Wrong Regional Endpoint</strong>
The PBS server and the S3 bucket were in different regions. Every S3 API call was incurring unnecessary cross-region latency. With millions of small chunk objects, this latency compounds significantly — S3 is a high-request-count workload.</p>

<p><strong>2. Glacier Lifecycle Disaster</strong>
A lifecycle policy was transitioning objects through Glacier tiers:</p>
<ul>
  <li>Day 14 → Glacier Instant Retrieval</li>
  <li>Day 104 → Glacier Flexible Retrieval</li>
  <li>Day 194 → Glacier Deep Archive</li>
</ul>

<p>As covered above, this is fundamentally incompatible with PBS. It was silently breaking GC and verification, and would have made restores impossible for older backups.</p>

<p><strong>3. Unbounded Cache Growth</strong>
The 850 GB cache disk was 65% full with 1.67M chunk files across 65,536 subdirectories. PBS docs recommend only 64–128 GiB for the cache.</p>

<p>Cache breakdown:</p>
<ul>
  <li>~71% of chunks were 0-byte marker files (cache index markers)</li>
  <li>~29% contained actual cached data</li>
  <li>Chunks from months ago were still in the cache</li>
  <li>No automatic cache eviction exists in this PBS version</li>
</ul>

<p><strong>4. TCP Keepalive Too Slow</strong>
Default <code class="language-plaintext highlighter-rouge">tcp_keepalive_time</code> was 7200 seconds (2 hours). Dead S3 connections weren’t detected for hours, causing the “Transport endpoint not connected” errors. High latency to a distant S3 region made this worse — more connections timing out silently.</p>

<p><strong>5. Ext4 Wasted Space</strong>
The cache disk had 4.18% reserved blocks — about 37 GB wasted on a disk where root reservation serves no purpose.</p>

<p><strong>6. GC Schedule Needed Review</strong>
Garbage collection frequency needs careful consideration with S3 backends — every GC run makes a large number of LIST and HEAD API calls against S3, which cost money. Running GC too frequently wastes money; too infrequently leaves orphaned chunks accumulating. Weekly is a reasonable balance for most setups.</p>

<p><strong>7. S3 Endpoint Style</strong>
Using path-style addressing (<code class="language-plaintext highlighter-rouge">s3.amazonaws.com/bucket/key</code>) instead of the recommended vhost-style (<code class="language-plaintext highlighter-rouge">bucket.s3.region.amazonaws.com/key</code>).</p>

<hr />

<h2 id="the-optimizations">The Optimizations</h2>

<h3 id="phase-1-no-downtime-changes">Phase 1: No-Downtime Changes</h3>

<h4 id="1-gc-schedule">1. GC Schedule</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>proxmox-backup-manager datastore update pbs-s3 <span class="nt">--gc-schedule</span> <span class="s2">"sat 02:00"</span>
</code></pre></div></div>

<p>Weekly GC on Saturday at 2am. Frequent enough to keep orphaned chunks in check, infrequent enough to keep S3 API costs reasonable.</p>

<h4 id="2-ext4-reserved-blocks-418--1">2. Ext4 Reserved Blocks: 4.18% → 1%</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tune2fs <span class="nt">-m</span> 1 /dev/sdc
</code></pre></div></div>

<p>Freed ~28 GB immediately. No reason to reserve 37 GB for root on a cache disk.</p>

<h4 id="3-tcp-keepalive-tuning">3. TCP Keepalive Tuning</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="o">&gt;</span> /etc/sysctl.d/99-s3-tuning.conf <span class="o">&lt;&lt;</span> <span class="no">EOF</span><span class="sh">
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_fin_timeout = 30
</span><span class="no">EOF
</span>sysctl <span class="nt">-p</span> /etc/sysctl.d/99-s3-tuning.conf
</code></pre></div></div>

<p>Dead S3 connections now detected in ~2 minutes instead of 2 hours. Essential when connection latency is non-trivial.</p>

<h4 id="4-ext4-mount-options">4. Ext4 Mount Options</h4>

<p>Updated <code class="language-plaintext highlighter-rouge">/etc/fstab</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UUID=&lt;disk-uuid&gt; /mnt/S3BackupCache ext4 noatime,commit=120 0 2
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">noatime</code> — Eliminates metadata writes on every access across 1.67M files</li>
  <li><code class="language-plaintext highlighter-rouge">commit=120</code> — Reduces journal commit frequency (cache is reconstructible from S3)</li>
  <li>UUID-based mount for stability across disk reorders</li>
</ul>

<h4 id="5-s3-endpoint-path-style--vhost-style">5. S3 Endpoint: Path-style → Vhost-style</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>proxmox-backup-manager s3 endpoint update pbs-s3 <span class="se">\</span>
    <span class="nt">--endpoint</span> <span class="s1">'.s3..amazonaws.com'</span> <span class="se">\</span>
    <span class="nt">--region</span> &lt;your-region&gt; <span class="se">\</span>
    <span class="nt">--delete</span> path-style
</code></pre></div></div>

<p>Direct regional routing rather than the global endpoint.</p>

<h3 id="phase-2-restart-required">Phase 2: Restart Required</h3>

<h4 id="6-ram-increased">6. RAM: Increased</h4>

<p>More RAM means better filesystem caching for 1.67M chunk files — the OS page cache can hold more of the chunk index in memory.</p>

<hr />

<h2 id="the-migration-moving-to-a-closer-region">The Migration: Moving to a Closer Region</h2>

<h3 id="the-problem">The Problem</h3>

<p>The PBS server and S3 bucket were in different regions. Every backup chunk upload and every GC/verification API call was crossing region boundaries. This was the root cause of the elevated latency and connection instability.</p>

<h3 id="step-1-create-new-bucket-in-the-correct-region">Step 1: Create New Bucket in the Correct Region</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws s3api create-bucket <span class="se">\</span>
    <span class="nt">--bucket</span> &lt;your-new-bucket-name&gt; <span class="se">\</span>
    <span class="nt">--region</span> &lt;closer-region&gt; <span class="se">\</span>
    <span class="nt">--create-bucket-configuration</span> <span class="nv">LocationConstraint</span><span class="o">=</span>&lt;closer-region&gt;
</code></pre></div></div>

<p>Configured with:</p>
<ul>
  <li><strong>S3 Intelligent-Tiering</strong> lifecycle (no Glacier)</li>
  <li>Server-side encryption</li>
  <li>Randomized bucket name for security</li>
</ul>

<h3 id="step-2-restore-glacier-objects">Step 2: Restore Glacier Objects</h3>

<p>The biggest challenge — over 860,000 objects were in Glacier or Deep Archive and needed to be restored before they could be copied.</p>

<table>
  <thead>
    <tr>
      <th>Storage Class</th>
      <th>Objects</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>STANDARD</td>
      <td>~828,000</td>
    </tr>
    <tr>
      <td>GLACIER_IR</td>
      <td>~212,000</td>
    </tr>
    <tr>
      <td>GLACIER</td>
      <td>~433,000</td>
    </tr>
    <tr>
      <td>DEEP_ARCHIVE</td>
      <td>~430,000</td>
    </tr>
  </tbody>
</table>

<h4 id="first-attempt-individual-api-calls-too-slow">First Attempt: Individual API Calls (Too Slow)</h4>

<p>Started with parallel <code class="language-plaintext highlighter-rouge">aws s3api restore-object</code> calls. At ~1–2 seconds per call with 860K objects, this would have taken days.</p>

<h4 id="solution-s3-batch-operations">Solution: S3 Batch Operations</h4>

<p>Used S3 Batch Operations to restore all Glacier objects server-side:</p>

<ol>
  <li>Generated a CSV manifest of all Glacier objects</li>
  <li>Created an IAM role for batch operations</li>
  <li>Submitted the batch job via the AWS console</li>
</ol>

<p><strong>Result:</strong> ~810,000 succeeded, ~51,000 “failed” with <code class="language-plaintext highlighter-rouge">RestoreAlreadyInProgress</code> (from our earlier individual attempts — not real failures). Completed in ~2 hours entirely on AWS infrastructure.</p>

<h3 id="step-3-copy-data-to-new-region">Step 3: Copy Data to New Region</h3>

<h4 id="standard-objects-aws-s3-sync">Standard Objects (<code class="language-plaintext highlighter-rouge">aws s3 sync</code>)</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws s3 <span class="nb">sync </span>s3://&lt;source-bucket&gt; s3://&lt;dest-bucket&gt; <span class="se">\</span>
    <span class="nt">--region</span> &lt;dest-region&gt; <span class="se">\</span>
    <span class="nt">--source-region</span> &lt;source-region&gt; <span class="se">\</span>
    <span class="nt">--storage-class</span> INTELLIGENT_TIERING
</code></pre></div></div>

<p>However, <code class="language-plaintext highlighter-rouge">aws s3 sync</code> refuses to copy objects with GLACIER storage class — even after they’ve been restored.</p>

<h4 id="glacier-objects-boto3">Glacier Objects (boto3)</h4>

<p>Used Python boto3 to copy the restored Glacier objects:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">concurrent.futures</span> <span class="kn">import</span> <span class="n">ThreadPoolExecutor</span>
<span class="kn">import</span> <span class="n">boto3</span>

<span class="n">s3_dst</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="nf">client</span><span class="p">(</span><span class="sh">'</span><span class="s">s3</span><span class="sh">'</span><span class="p">,</span> <span class="n">region_name</span><span class="o">=</span><span class="sh">'</span><span class="s">&lt;dest-region&gt;</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">copy_one</span><span class="p">(</span><span class="n">key</span><span class="p">):</span>
    <span class="n">s3_dst</span><span class="p">.</span><span class="nf">copy_object</span><span class="p">(</span>
        <span class="n">Bucket</span><span class="o">=</span><span class="sh">'</span><span class="s">&lt;dest-bucket&gt;</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">Key</span><span class="o">=</span><span class="n">key</span><span class="p">,</span>
        <span class="n">CopySource</span><span class="o">=</span><span class="p">{</span><span class="sh">'</span><span class="s">Bucket</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">&lt;source-bucket&gt;</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Key</span><span class="sh">'</span><span class="p">:</span> <span class="n">key</span><span class="p">},</span>
        <span class="n">StorageClass</span><span class="o">=</span><span class="sh">'</span><span class="s">INTELLIGENT_TIERING</span><span class="sh">'</span>
    <span class="p">)</span>

<span class="k">with</span> <span class="nc">ThreadPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> <span class="k">as</span> <span class="n">executor</span><span class="p">:</span>
    <span class="n">executor</span><span class="p">.</span><span class="nf">map</span><span class="p">(</span><span class="n">copy_one</span><span class="p">,</span> <span class="n">glacier_keys</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Result:</strong> ~833,000 objects copied, 0 failures. ✅</p>

<h3 id="step-4-switch-pbs-to-new-bucket">Step 4: Switch PBS to New Bucket</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Maintenance mode</span>
proxmox-backup-manager datastore update pbs-s3 <span class="se">\</span>
    <span class="nt">--maintenance-mode</span> <span class="s1">'type=offline,message="Migrating region"'</span>

<span class="c"># Update endpoint region</span>
proxmox-backup-manager s3 endpoint update pbs-s3 <span class="nt">--region</span> &lt;new-region&gt;

<span class="c"># Update bucket name in config</span>
<span class="nb">sed</span> <span class="nt">-i</span> <span class="s1">'s/bucket=&lt;old-bucket&gt;/bucket=&lt;new-bucket&gt;/'</span> <span class="se">\</span>
    /etc/proxmox-backup/datastore.cfg

<span class="c"># Verify connectivity</span>
proxmox-backup-manager s3 check pbs-s3 &lt;new-bucket&gt;

<span class="c"># Remove maintenance mode</span>
proxmox-backup-manager datastore update pbs-s3 <span class="nt">--delete</span> maintenance-mode
</code></pre></div></div>

<h3 id="step-5-full-verification">Step 5: Full Verification</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>proxmox-backup-manager verify-job update &lt;verify-job-id&gt; <span class="nt">--ignore-verified</span> <span class="nb">false
</span>proxmox-backup-manager verify-job run &lt;verify-job-id&gt;
</code></pre></div></div>

<hr />

<h2 id="lifecycle-policy-the-right-way">Lifecycle Policy: The Right Way</h2>

<h3 id="-wrong-what-i-had">❌ Wrong (What I Had)</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Day 0   → S3 Standard
Day 14  → Glacier Instant Retrieval
Day 104 → Glacier Flexible Retrieval
Day 194 → Glacier Deep Archive
</code></pre></div></div>

<p>This breaks PBS completely — GC, verification, dedup, and restores all require immediate chunk access.</p>

<h3 id="-correct">✅ Correct</h3>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"Rules"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w">
        </span><span class="nl">"ID"</span><span class="p">:</span><span class="w"> </span><span class="s2">"pbs-intelligent-tiering"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"Status"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Enabled"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"Filter"</span><span class="p">:</span><span class="w"> </span><span class="p">{},</span><span class="w">
        </span><span class="nl">"Transitions"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w">
            </span><span class="nl">"Days"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
            </span><span class="nl">"StorageClass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"INTELLIGENT_TIERING"</span><span class="w">
        </span><span class="p">}]</span><span class="w">
    </span><span class="p">}]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>S3 Intelligent-Tiering automatically moves infrequently accessed data to cheaper tiers, but everything remains <strong>immediately accessible</strong> with no retrieval fees or delays.</p>

<hr />

<h2 id="cache-disk-shrink">Cache Disk Shrink</h2>

<p>After migration, the cache disk was shrunk from 850 GB to 128 GiB:</p>

<ol>
  <li>Add new smaller disk to the VM</li>
  <li>Put datastore in maintenance mode, stop proxy</li>
  <li>Format new disk: <code class="language-plaintext highlighter-rouge">mkfs.ext4 -L S3BackupCache /dev/sdX &amp;&amp; tune2fs -m 1 /dev/sdX</code></li>
  <li>Update <code class="language-plaintext highlighter-rouge">/etc/fstab</code> with UUID of new disk</li>
  <li>Mount, start proxy, remove maintenance mode</li>
  <li>Run <code class="language-plaintext highlighter-rouge">proxmox-backup-manager datastore s3-refresh pbs-s3</code> — this pulls all manifest/index files from S3 so existing backups become visible in the new cache</li>
  <li>Remove old disk</li>
</ol>

<blockquote>
  <p><strong>Important:</strong> After replacing the cache disk, run <code class="language-plaintext highlighter-rouge">s3-refresh</code>. The new disk starts empty — PBS won’t know about existing S3 backups until the manifests are downloaded. This is a one-time operation.</p>
</blockquote>

<hr />

<h2 id="before--after">Before &amp; After</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>S3 Region</td>
      <td>Distant region</td>
      <td><strong>Closer regional endpoint</strong></td>
    </tr>
    <tr>
      <td>API Latency</td>
      <td>High</td>
      <td><strong>Low</strong></td>
    </tr>
    <tr>
      <td>Endpoint Style</td>
      <td>path-style</td>
      <td><strong>vhost-style</strong></td>
    </tr>
    <tr>
      <td>Lifecycle</td>
      <td>Glacier cascade</td>
      <td><strong>Intelligent-Tiering</strong></td>
    </tr>
    <tr>
      <td>GC Frequency</td>
      <td>Monthly</td>
      <td><strong>Weekly</strong></td>
    </tr>
    <tr>
      <td>TCP Keepalive</td>
      <td>2 hours</td>
      <td><strong>60 seconds</strong></td>
    </tr>
    <tr>
      <td>Mount Options</td>
      <td>defaults</td>
      <td><strong>noatime,commit=120</strong></td>
    </tr>
    <tr>
      <td>Reserved Blocks</td>
      <td>4.18% (37 GB wasted)</td>
      <td><strong>1%</strong></td>
    </tr>
    <tr>
      <td>Cache Disk</td>
      <td>850 GB (unbounded)</td>
      <td><strong>128 GiB</strong></td>
    </tr>
    <tr>
      <td>Connection Errors</td>
      <td>Frequent</td>
      <td><strong>Gone</strong></td>
    </tr>
    <tr>
      <td>Backup Performance</td>
      <td>Unoptimised</td>
      <td><strong>Optimised</strong></td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="lessons-learned">Lessons Learned</h2>

<ol>
  <li>
    <p><strong>Never use Glacier lifecycle policies with PBS S3.</strong> PBS needs immediate access to all chunks. Use Intelligent-Tiering instead. Check this before doing anything else.</p>
  </li>
  <li>
    <p><strong>S3 region matters.</strong> Put the bucket in the same or closest available region to the PBS server. Cross-region latency compounds badly with high object counts.</p>
  </li>
  <li>
    <p><strong>GC frequency vs. S3 API cost is a real tradeoff.</strong> Every GC run makes thousands of API calls. Don’t run it more frequently than necessary — weekly is a good default for most homelab setups.</p>
  </li>
  <li>
    <p><strong>TCP keepalive tuning is critical for S3.</strong> The default 2-hour timeout means dead connections go undetected. With any meaningful latency, this causes intermittent backup failures.</p>
  </li>
  <li>
    <p><strong>The PBS S3 cache needs deliberate sizing.</strong> 64–128 GiB is recommended. An oversized cache disk just fills with stale data and is never evicted.</p>
  </li>
  <li>
    <p><strong>After replacing the cache disk, run <code class="language-plaintext highlighter-rouge">s3-refresh</code>.</strong> The new disk starts empty — existing S3 backups won’t be visible until manifests are downloaded.</p>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">aws s3 sync</code> won’t copy GLACIER-class objects</strong> even when restored. Use boto3 <code class="language-plaintext highlighter-rouge">copy_object()</code> for those.</p>
  </li>
  <li>
    <p><strong>ext4 <code class="language-plaintext highlighter-rouge">noatime</code> is essential</strong> with millions of small files. Every read normally updates access time metadata — eliminating this overhead makes a noticeable difference.</p>
  </li>
</ol>

<hr />

<p><em>Tags: proxmox, pbs, s3, aws, glacier, backup, optimization, homelab</em></p>]]></content><author><name>Joshua Mein</name></author><category term="Homelab" /><category term="DevOps" /><category term="proxmox" /><category term="pbs" /><category term="s3" /><category term="aws" /><category term="glacier" /><category term="backup" /><category term="optimization" /><category term="linux" /><summary type="html"><![CDATA[How I investigated and resolved PBS S3 connection issues, migrated to a closer regional endpoint, and properly optimized backups after a Glacier lifecycle misconfiguration.]]></summary></entry><entry><title type="html">Tuning Open WebUI + AWS Bedrock for Complex AI Workflows — Timeouts, Code Execution, and Custom Patches</title><link href="https://joshwaamein.github.io/posts/tuning-openwebui-bedrock-complex-ai-workflows/" rel="alternate" type="text/html" title="Tuning Open WebUI + AWS Bedrock for Complex AI Workflows — Timeouts, Code Execution, and Custom Patches" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://joshwaamein.github.io/posts/tuning-openwebui-bedrock-complex-ai-workflows</id><content type="html" xml:base="https://joshwaamein.github.io/posts/tuning-openwebui-bedrock-complex-ai-workflows/"><![CDATA[<p>My self-hosted AI setup runs <a href="https://github.com/open-webui/open-webui">Open WebUI</a> backed by AWS Bedrock via a custom gateway. Simple queries work fine. But complex workflows — sub-agents making dozens of tool calls, web searches, and code execution — kept timing out, dropping connections, or just hanging indefinitely.</p>

<p>This post covers the full diagnosis and every customisation I’ve made, including the trade-offs and drawbacks of each one.</p>

<hr />

<h2 id="️-the-architecture">🏗️ The Architecture</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Browser → Open WebUI (Docker)
              ↓
         Bedrock Gateway (Docker, internal network)
              ↓
         AWS Bedrock API (eu-west-2)
              ↓
         SearXNG (web search) / Tika (document parsing) / Jupyter (code execution)
</code></pre></div></div>

<p>Six Docker containers on a shared bridge network, all communicating internally. Open WebUI is the only container with an exposed port. The Bedrock gateway translates OpenAI-compatible API calls into AWS Bedrock’s ConverseStream format, with cross-region inference enabled so models appear with <code class="language-plaintext highlighter-rouge">global.*</code> prefixes and route automatically.</p>

<hr />

<h2 id="-the-problem">🐛 The Problem</h2>

<p>Complex queries with sub-agents or code execution would fail in three ways:</p>

<ol>
  <li><strong>WebSocket drops</strong> — the browser connection would silently die mid-response</li>
  <li><strong>Code execution hangs</strong> — Python code blocks would take 30+ seconds or never return</li>
  <li><strong>Bedrock validation errors</strong> — tool-use conversations would hit <code class="language-plaintext highlighter-rouge">400 Bad Request</code> after many iterations</li>
</ol>

<p>Simple one-shot queries worked perfectly. The failures only surfaced during multi-turn, tool-heavy workflows.</p>

<hr />

<h2 id="-the-investigation">🔍 The Investigation</h2>

<h3 id="websocket-keepalive-failures">WebSocket Keepalive Failures</h3>

<p>The Open WebUI logs showed repeated errors:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>keepalive ping failed
AssertionError
  File "websockets/legacy/protocol.py", line 308, in _drain_helper
    assert waiter is None or waiter.cancelled()
</code></pre></div></div>

<p>This is a <a href="https://github.com/python-websockets/websockets/issues">known bug in websockets v16.0</a> — the library’s legacy protocol throws an <code class="language-plaintext highlighter-rouge">AssertionError</code> when trying to send a ping to a connection that’s mid-drain. During complex queries, the server is busy processing tool calls and can’t respond to WebSocket pings in time.</p>

<p>The default <code class="language-plaintext highlighter-rouge">WEBSOCKET_SERVER_PING_TIMEOUT</code> is <strong>20 seconds</strong>. A single sub-agent iteration with web search, code execution, and LLM response easily exceeds that.</p>

<h3 id="code-execution-round-trip">Code Execution Round-Trip</h3>

<p>Open WebUI’s default code execution engine is <strong>Pyodide</strong> — a WebAssembly Python runtime that runs <em>in the browser</em>. The execution path for every code block is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Server → WebSocket event → Browser → Pyodide WASM → Browser → WebSocket → Server → Bedrock API
</code></pre></div></div>

<p>Every code block makes a full round-trip through the browser’s WebSocket connection. On a multi-step sub-agent workflow running 3-5 code blocks, this adds 30-60 seconds of pure overhead — and if the WebSocket drops mid-execution, the entire workflow fails silently.</p>

<h3 id="bedrock-validation-errors">Bedrock Validation Errors</h3>

<p>Two specific errors appeared in the gateway logs during long conversations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ValidationException: The toolConfig field must be defined when using
toolUse and toolResult content blocks.
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ValidationException: prompt is too long: 2,084,831 tokens &gt; 1,000,000 maximum
</code></pre></div></div>

<p>The first indicates tool configuration wasn’t being forwarded properly on follow-up turns. The second shows conversation history accumulating past Bedrock’s 1M token context window — a natural consequence of sub-agents that generate hundreds of tool call results.</p>

<h3 id="api-latency">API Latency</h3>

<p>The Bedrock gateway was configured to use <code class="language-plaintext highlighter-rouge">us-east-1</code> (Virginia). Every API call — and there are dozens per sub-agent workflow — was crossing the Atlantic and back. With the server physically located in the UK, this added 100-200ms per request, compounding across multi-turn conversations.</p>

<hr />

<h2 id="️-the-fixes">🛠️ The Fixes</h2>

<h3 id="fix-1-increase-websocket-and-http-timeouts">Fix 1: Increase WebSocket and HTTP Timeouts</h3>

<p>Three environment variables on the Open WebUI container:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-e</span> <span class="nv">WEBSOCKET_SERVER_PING_TIMEOUT</span><span class="o">=</span>120    <span class="c"># Was 20s — prevents keepalive failures</span>
<span class="nt">-e</span> <span class="nv">WEBSOCKET_EVENT_CALLER_TIMEOUT</span><span class="o">=</span>600   <span class="c"># Was 300s — allows longer tool chains</span>
<span class="nt">-e</span> <span class="nv">AIOHTTP_CLIENT_TIMEOUT</span><span class="o">=</span>600           <span class="c"># Was 300s — prevents HTTP client timeouts</span>
</code></pre></div></div>

<p><strong>Why:</strong> The defaults assume short request-response cycles. Sub-agent workflows with tool calls, web searches, and code execution routinely exceed 5 minutes end-to-end.</p>

<p><strong>Drawback:</strong> Higher timeouts mean genuinely broken connections take longer to detect. A hung request will now sit for 10 minutes before timing out, consuming a server thread the entire time. On a resource-constrained system, this could become a problem under concurrent usage.</p>

<h3 id="fix-2-server-side-code-execution-with-jupyter">Fix 2: Server-Side Code Execution with Jupyter</h3>

<p>Replaced the browser-side Pyodide engine with a server-side Jupyter notebook container:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">jupyter</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">jupyter/scipy-notebook:latest</span>
    <span class="na">container_name</span><span class="pi">:</span> <span class="s">jupyter</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">always</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">JUPYTER_TOKEN=&lt;token&gt;</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s">start-notebook.py --NotebookApp.allow_origin='*' --NotebookApp.ip='0.0.0.0'</span>
    <span class="na">networks</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">ai-services</span>
</code></pre></div></div>

<p>Open WebUI configured with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-e</span> <span class="nv">CODE_EXECUTION_ENGINE</span><span class="o">=</span>jupyter
<span class="nt">-e</span> <span class="nv">CODE_INTERPRETER_ENGINE</span><span class="o">=</span>jupyter
<span class="nt">-e</span> <span class="nv">CODE_EXECUTION_JUPYTER_URL</span><span class="o">=</span>http://jupyter:8888
<span class="nt">-e</span> <span class="nv">CODE_INTERPRETER_JUPYTER_URL</span><span class="o">=</span>http://jupyter:8888
<span class="nt">-e</span> <span class="nv">CODE_EXECUTION_JUPYTER_AUTH</span><span class="o">=</span>token
<span class="nt">-e</span> <span class="nv">CODE_INTERPRETER_JUPYTER_AUTH</span><span class="o">=</span>token
<span class="nt">-e</span> <span class="nv">CODE_EXECUTION_JUPYTER_AUTH_TOKEN</span><span class="o">=</span>&lt;token&gt;
<span class="nt">-e</span> <span class="nv">CODE_INTERPRETER_JUPYTER_AUTH_TOKEN</span><span class="o">=</span>&lt;token&gt;
<span class="nt">-e</span> <span class="nv">CODE_EXECUTION_JUPYTER_TIMEOUT</span><span class="o">=</span>60
<span class="nt">-e</span> <span class="nv">CODE_INTERPRETER_JUPYTER_TIMEOUT</span><span class="o">=</span>60
</code></pre></div></div>

<p>The execution path is now:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Server → Jupyter HTTP API → Server
</code></pre></div></div>

<p>No browser round-trip, no WebSocket dependency, and <code class="language-plaintext highlighter-rouge">scipy-notebook</code> ships with NumPy, pandas, matplotlib, and SciPy pre-installed.</p>

<p><strong>Why:</strong> Eliminates the browser round-trip entirely. Code execution drops from 10-30 seconds to 1-3 seconds. The Jupyter kernel persists state across code blocks within a session, so variables and imports carry over.</p>

<p><strong>Drawback:</strong> The <code class="language-plaintext highlighter-rouge">jupyter/scipy-notebook</code> image is ~1.5GB and uses significant RAM. On a memory-constrained system, this adds pressure. The Jupyter server also has full access to the Docker network — any code the LLM generates runs server-side with the same network access as every other container. This is a real security consideration for multi-user deployments.</p>

<h3 id="fix-3-move-bedrock-to-eu-west-2-london">Fix 3: Move Bedrock to eu-west-2 (London)</h3>

<p>Changed the gateway’s AWS region from <code class="language-plaintext highlighter-rouge">us-east-1</code> to <code class="language-plaintext highlighter-rouge">eu-west-2</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">environment</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">AWS_REGION=eu-west-2</span>
</code></pre></div></div>

<p>With cross-region inference enabled, <code class="language-plaintext highlighter-rouge">global.*</code> model prefixes automatically route to the nearest available capacity.</p>

<p><strong>Why:</strong> Reduces API latency by ~100-200ms per request. Over a 20-turn sub-agent workflow, that’s 2-4 seconds saved — and more importantly, fewer timeout-inducing delays.</p>

<p><strong>Drawback:</strong> If a specific model isn’t available in <code class="language-plaintext highlighter-rouge">eu-west-2</code>, the cross-region routing adds its own overhead. Model availability can vary by region, though with <code class="language-plaintext highlighter-rouge">global.*</code> prefixes this is mostly transparent.</p>

<hr />

<h2 id="-custom-code-patches">🔬 Custom Code Patches</h2>

<p>I maintain three patched files that are bind-mounted into the containers, overriding upstream code. Each one exists to solve a specific problem, but they all come with maintenance costs.</p>

<h3 id="patch-1-empty-model-cache-guard-modelspy">Patch 1: Empty Model Cache Guard (<code class="language-plaintext highlighter-rouge">models.py</code>)</h3>

<p><strong>The problem:</strong> When the Bedrock gateway is temporarily unreachable, Open WebUI’s model list refresh returns empty. The upstream code caches this empty result, causing every subsequent request to fail with “Model not found” until the next successful refresh. During sub-agent workflows where the model list is re-checked between tool calls, this creates a cascade of failures.</p>

<p><strong>The fix:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Only update the cache if we got a non-empty model list
</span><span class="k">if</span> <span class="n">models_dict</span><span class="p">:</span>
    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">app</span><span class="p">.</span><span class="n">state</span><span class="p">.</span><span class="n">MODELS</span><span class="p">,</span> <span class="n">RedisDict</span><span class="p">):</span>
        <span class="n">request</span><span class="p">.</span><span class="n">app</span><span class="p">.</span><span class="n">state</span><span class="p">.</span><span class="n">MODELS</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="n">models_dict</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">request</span><span class="p">.</span><span class="n">app</span><span class="p">.</span><span class="n">state</span><span class="p">.</span><span class="n">MODELS</span> <span class="o">=</span> <span class="n">models_dict</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">log</span><span class="p">.</span><span class="nf">warning</span><span class="p">(</span><span class="sh">'</span><span class="s">get_all_models() returned empty model list, keeping previous cache</span><span class="sh">'</span><span class="p">)</span>
</code></pre></div></div>

<p>Same pattern applied to <code class="language-plaintext highlighter-rouge">BASE_MODELS</code>.</p>

<p><strong>Drawback:</strong> If a model is genuinely removed from Bedrock, the stale cache will continue serving it until a successful refresh eventually returns the updated list. This could cause confusing errors if a user selects a model that exists in cache but no longer exists upstream.</p>

<h3 id="patch-2-default-feature-flags-middlewarepy">Patch 2: Default Feature Flags (<code class="language-plaintext highlighter-rouge">middleware.py</code>)</h3>

<p><strong>The problem:</strong> Open WebUI requires users to manually enable web search and memory recall per-chat. For a single-user setup where you always want these features, this is friction.</p>

<p><strong>The fix:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">features</span> <span class="o">=</span> <span class="n">form_data</span><span class="p">.</span><span class="nf">pop</span><span class="p">(</span><span class="sh">'</span><span class="s">features</span><span class="sh">'</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span> <span class="ow">or</span> <span class="p">{}</span>
<span class="n">features</span><span class="p">.</span><span class="nf">setdefault</span><span class="p">(</span><span class="sh">'</span><span class="s">web_search</span><span class="sh">'</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">features</span><span class="p">.</span><span class="nf">setdefault</span><span class="p">(</span><span class="sh">'</span><span class="s">memory</span><span class="sh">'</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Drawback:</strong> Every single chat now triggers a web search — even for simple “hello” messages. This adds 2-5 seconds of latency to every response, increases API costs (SearXNG queries + RAG processing), and occasionally returns irrelevant search results that confuse the model. Memory retrieval runs on every message too, adding its own overhead.</p>

<h3 id="patch-3-default-max_tokens-middlewarepy">Patch 3: Default max_tokens (<code class="language-plaintext highlighter-rouge">middleware.py</code>)</h3>

<p><strong>The problem:</strong> Without an explicit <code class="language-plaintext highlighter-rouge">max_tokens</code>, some Bedrock models default to very low token limits, causing truncated responses. This is particularly harmful for tool-use scenarios where the model needs to output complete JSON for function call arguments.</p>

<p><strong>The fix:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="sh">'</span><span class="s">max_tokens</span><span class="sh">'</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">form_data</span><span class="p">:</span>
    <span class="n">form_data</span><span class="p">[</span><span class="sh">'</span><span class="s">max_tokens</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">16384</span>
</code></pre></div></div>

<p><strong>Drawback:</strong> Higher token limits increase API costs per request. A 16K token limit means every single request — including short yes/no answers — is budgeted for 16K tokens of output. The cost impact is real but manageable for single-user usage.</p>

<h3 id="patch-4-bedrock-gateway-model-caching-model_patchedpy">Patch 4: Bedrock Gateway Model Caching (<code class="language-plaintext highlighter-rouge">model_patched.py</code>)</h3>

<p><strong>The problem:</strong> The upstream Bedrock gateway calls AWS’s <code class="language-plaintext highlighter-rouge">ListFoundationModels</code> and <code class="language-plaintext highlighter-rouge">ListInferenceProfiles</code> APIs on every single <code class="language-plaintext highlighter-rouge">/models</code> request. These are synchronous boto3 calls that block the async event loop and take 1-3 seconds each.</p>

<p><strong>The fix:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_cached_models</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">_cache_timestamp</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">_CACHE_TTL</span> <span class="o">=</span> <span class="mi">300</span>  <span class="c1"># 5 minutes
</span>
<span class="k">def</span> <span class="nf">_get_models_cached</span><span class="p">():</span>
    <span class="k">global</span> <span class="n">_cached_models</span><span class="p">,</span> <span class="n">_cache_timestamp</span>
    <span class="n">now</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
    <span class="k">if</span> <span class="n">_cached_models</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="p">(</span><span class="n">now</span> <span class="o">-</span> <span class="n">_cache_timestamp</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">_CACHE_TTL</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">_cached_models</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">models</span> <span class="o">=</span> <span class="n">chat_model</span><span class="p">.</span><span class="nf">list_models</span><span class="p">()</span>
        <span class="n">_cached_models</span> <span class="o">=</span> <span class="n">models</span>
        <span class="n">_cache_timestamp</span> <span class="o">=</span> <span class="n">now</span>
        <span class="k">return</span> <span class="n">models</span>
    <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">_cached_models</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">_cached_models</span>  <span class="c1"># Stale cache on error
</span>        <span class="k">raise</span>
</code></pre></div></div>

<p>Also wrapped in <code class="language-plaintext highlighter-rouge">run_in_threadpool</code> to prevent event loop blocking.</p>

<p><strong>Drawback:</strong> New models deployed to Bedrock won’t appear for up to 5 minutes. There’s no cache invalidation mechanism — the only way to force a refresh is to restart the gateway container. The global mutable state could theoretically have race conditions under high concurrency.</p>

<hr />

<h2 id="️-the-cost-of-custom-patches">⚠️ The Cost of Custom Patches</h2>

<p>All four patches are applied via Docker bind mounts — the patched files are stored on the host and mounted over the container’s originals at startup. This means:</p>

<ol>
  <li><strong>Watchtower updates don’t break the patches</strong> — the bind mounts persist across image updates</li>
  <li><strong>But upstream API changes can break everything</strong> — if an Open WebUI update changes internal function signatures that the patches depend on, the container will crash on startup</li>
  <li><strong>Version drift accumulates</strong> — the longer you maintain patches, the harder it becomes to merge upstream improvements</li>
</ol>

<p>I originally maintained a fully pinned <code class="language-plaintext highlighter-rouge">middleware.py</code> (all 4,887 lines), but the drift became unsustainable. The pinned version was missing over a dozen upstream fixes including <code class="language-plaintext highlighter-rouge">strip_empty_content_blocks()</code> (which prevents Claude/Gemini errors), <code class="language-plaintext highlighter-rouge">merge_system_messages()</code> (which prevents template parsing failures), and proper <code class="language-plaintext highlighter-rouge">done: True</code> completion markers.</p>

<p>The current approach is better: <strong>start from the latest upstream, apply minimal targeted patches.</strong> The four patches above total ~20 lines of actual changes. When upstream updates, re-extracting the base files and re-applying the patches takes minutes, not hours.</p>

<hr />

<h2 id="-results">📊 Results</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Sub-agent success rate</td>
      <td>~60% (intermittent drops)</td>
      <td>~100%</td>
    </tr>
    <tr>
      <td>Code execution time</td>
      <td>10-30s per block (Pyodide)</td>
      <td>1-3s per block (Jupyter)</td>
    </tr>
    <tr>
      <td>WebSocket “keepalive ping failed”</td>
      <td>Every few minutes</td>
      <td>Rare (idle connections only)</td>
    </tr>
    <tr>
      <td>Bedrock API latency</td>
      <td>~200ms (us-east-1)</td>
      <td>~50ms (eu-west-2)</td>
    </tr>
    <tr>
      <td>Custom patch maintenance</td>
      <td>4,887-line pinned file</td>
      <td>~20 lines across 3 files</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="-lessons-learned">💡 Lessons Learned</h2>

<h3 id="1-pyodide-is-the-wrong-tool-for-server-side-ai-workflows">1. Pyodide Is the Wrong Tool for Server-Side AI Workflows</h3>

<p>Browser-based code execution makes sense for interactive notebooks. For autonomous AI agents running multi-step code workflows, the WebSocket round-trip is a dealbreaker. Jupyter is heavier but eliminates an entire class of failure modes.</p>

<h3 id="2-default-timeouts-assume-simple-conversations">2. Default Timeouts Assume Simple Conversations</h3>

<p>Most AI UIs are designed for single-turn Q&amp;A. When you add sub-agents, tool calls, web search, code execution, and RAG — all in a single conversation turn — the default 20-second WebSocket ping timeout is laughably short. Know your workload and set timeouts accordingly.</p>

<h3 id="3-maintain-patches-not-forks">3. Maintain Patches, Not Forks</h3>

<p>Pinning an entire 5,000-line file to avoid upstream breakage feels safe, but it’s a trap. You lose every upstream bugfix and improvement. Minimal, targeted patches that can be re-applied to fresh upstream files are far more sustainable.</p>

<h3 id="4-every-customisation-has-a-cost">4. Every Customisation Has a Cost</h3>

<p>Defaulting web search to “always on” sounds great until every trivial question adds 3 seconds of latency. Setting <code class="language-plaintext highlighter-rouge">max_tokens=16384</code> prevents truncation but increases API costs. Server-side Jupyter execution is fast but widens the attack surface. <strong>Document the trade-offs, not just the benefits.</strong></p>

<h3 id="5-cache-defensively">5. Cache Defensively</h3>

<p>Never replace good data with empty data. Whether it’s model lists, DNS caches, or configuration stores — if the upstream source is temporarily unavailable, serving stale data is almost always better than serving nothing.</p>]]></content><author><name>Joshua Mein</name></author><category term="Cloud" /><category term="DevOps" /><category term="docker" /><category term="aws" /><category term="bedrock" /><category term="openwebui" /><category term="python" /><category term="jupyter" /><summary type="html"><![CDATA[How I diagnosed and fixed timeout failures, slow code execution, and WebSocket drops in my self-hosted Open WebUI + AWS Bedrock setup — including custom patches, a Jupyter code execution server, and the trade-offs of maintaining upstream forks.]]></summary></entry><entry><title type="html">I Audited Every VM in My Homelab — Here’s What I Found (and Fixed)</title><link href="https://joshwaamein.github.io/posts/i-audited-every-vm-in-my-homelab/" rel="alternate" type="text/html" title="I Audited Every VM in My Homelab — Here’s What I Found (and Fixed)" /><published>2026-03-26T12:00:00+00:00</published><updated>2026-03-26T12:00:00+00:00</updated><id>https://joshwaamein.github.io/posts/i-audited-every-vm-in-my-homelab</id><content type="html" xml:base="https://joshwaamein.github.io/posts/i-audited-every-vm-in-my-homelab/"><![CDATA[<p>My homelab has been running for a couple of years now. Three Proxmox hosts, 27 VMs, a mix of blockchain validators, DNS, monitoring, backup servers, and various projects I’ve spun up and half-forgotten about. It works, mostly. But I’d never actually sat down and audited <em>everything</em> — checking what’s over-provisioned, what’s under-monitored, what’s running outdated software, and what’s one bad day away from a disk-full meltdown.</p>

<p>So I did exactly that. And it wasn’t pretty.</p>

<h2 id="the-audit">The Audit</h2>

<p>The audit covered every running VM across all three hosts, pulling data from Proxmox configs (<code class="language-plaintext highlighter-rouge">qm config</code>), Zabbix API metrics, and in-VM checks via <code class="language-plaintext highlighter-rouge">qm guest exec</code>. For each VM, I checked CPU and RAM utilisation against what was allocated, disk usage, backup coverage, monitoring status, guest agent health, OS version, and hardware configuration.</p>

<h2 id="the-scary-findings">The Scary Findings</h2>

<h3 id="a-disk-about-to-explode">A Disk About to Explode</h3>

<p>My AI server was sitting at <strong>81% disk usage with only 5GB free</strong> on a 32GB disk. It was one large model download away from grinding to a halt.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The fix was straightforward</span>
qm disk resize 3465 scsi0 +16G                    <span class="c"># Proxmox side</span>
qm guest <span class="nb">exec </span>3465 <span class="nt">--</span> growpart /dev/sda 3          <span class="c"># Grow partition</span>
qm guest <span class="nb">exec </span>3465 <span class="nt">--</span> pvresize /dev/sda3           <span class="c"># Extend PV</span>
qm guest <span class="nb">exec </span>3465 <span class="nt">--</span> lvextend <span class="nt">-l</span> +100%FREE /dev/ubuntu-vg/ubuntu-lv
qm guest <span class="nb">exec </span>3465 <span class="nt">--</span> resize2fs /dev/ubuntu-vg/ubuntu-lv
</code></pre></div></div>

<p>32GB to 48GB. Usage dropped from 81% to 49%. Crisis averted.</p>

<h3 id="four-vms-with-broken-monitoring">Four VMs With Broken Monitoring</h3>

<p>I had 4 VMs registered in Zabbix that were returning <strong>zero for every metric</strong>. They showed as “monitored” in the dashboard, but the Zabbix agent wasn’t actually running inside any of them. The hosts existed in Zabbix, the agent was installed, but the service was dead — so every graph was a flat line at zero.</p>

<p>If any of those VMs had a problem, I’d have had no alert. The fix was reinstalling zabbix-agent2 v7.0 from the official repo on all four, configuring the Zabbix server address, restarting the service, and verifying data started flowing through the Zabbix API.</p>

<h3 id="ghost-vms">Ghost VMs</h3>

<p>Two VMs had no QEMU guest agent at all — meaning Proxmox couldn’t cleanly shut them down, couldn’t run commands inside them, and couldn’t even see their IP addresses. One of them was stopped with no <code class="language-plaintext highlighter-rouge">onboot</code> flag, so it wouldn’t even survive a host reboot.</p>

<p>Getting the guest agent onto a stopped VM with no SSH access required mounting its raw disk on the host, injecting an SSH key, starting it, and then installing the agent. Not fun, but it worked.</p>

<h3 id="vms-still-on-ubuntu-2204">VMs Still on Ubuntu 22.04</h3>

<p>Seven of my Ubuntu VMs were still on 22.04 Jammy. Not end-of-life yet, but approaching standard support end. I’d been putting off the upgrades because doing them one-by-one is tedious and risky if something breaks. More on how I batched these later.</p>

<h3 id="pbs-servers-without-unattended-upgrades">PBS Servers Without Unattended-Upgrades</h3>

<p>My three Proxmox Backup Server instances — the systems responsible for protecting everything else — didn’t have <code class="language-plaintext highlighter-rouge">unattended-upgrades</code> configured. Now, these servers <em>are</em> patched regularly by my Ansible update playbook, so they weren’t actually unpatched. But Ansible runs on a schedule, and there’s always a gap between a critical CVE dropping and the next playbook run. Adding <code class="language-plaintext highlighter-rouge">unattended-upgrades</code> as a safety net means security patches get applied daily regardless of when Ansible runs next — belt and suspenders.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The key is not just installing the package, but configuring it to actually run</span>
<span class="nb">cat</span> <span class="o">&gt;</span> /etc/apt/apt.conf.d/20auto-upgrades <span class="o">&lt;&lt;</span> <span class="sh">'</span><span class="no">EOF</span><span class="sh">'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::AutocleanInterval "7";
</span><span class="no">EOF
</span></code></pre></div></div>

<h3 id="vms-on-directory-storage">VMs on Directory Storage</h3>

<p>Six VMs had their disks stored as qcow2/raw files on directory storage instead of LVM-thin. This means worse I/O performance, no thin provisioning, and more overhead. Most of my other VMs were already on LVM-thin — these were just stragglers from older deployments.</p>

<h3 id="a-backup-job-pointing-to-a-deleted-vm">A Backup Job Pointing to a Deleted VM</h3>

<p>One of the backup jobs was referencing a VMID that doesn’t exist anymore. Meanwhile, a VM that I’d recently created wasn’t in <em>any</em> backup job. Classic.</p>

<h2 id="the-remediation">The Remediation</h2>

<p>Here’s what I actually did, roughly in order of priority:</p>

<h3 id="critical">Critical</h3>

<ul>
  <li><strong>Expanded the AI server disk</strong> from 32GB to 48GB (live, no downtime)</li>
  <li><strong>Fixed 4 broken Zabbix agents</strong> — reinstalled zabbix-agent2 v7.0, configured the Zabbix server address, verified data flow through the API</li>
  <li><strong>Installed guest agents</strong> on 2 VMs that were previously unmanageable</li>
  <li><strong>Added an unmonitored VM to Zabbix</strong> — it had no monitoring at all</li>
</ul>

<h3 id="high-priority">High Priority</h3>

<ul>
  <li><strong>Fixed backup jobs</strong> — added the missing VM and removed the ghost VMID</li>
  <li><strong>Configured unattended-upgrades</strong> on all 3 PBS VMs as a safety net alongside Ansible</li>
</ul>

<h3 id="medium-priority">Medium Priority</h3>

<ul>
  <li><strong>Fixed boot orders</strong> on 5 VMs — removed unnecessary PXE boot entries that were slowing down startup</li>
  <li>
    <p><strong>Reduced CPU allocations</strong> on 3 over-provisioned VMs (one had 6 cores on an 8-core host at 3% usage)</p>
  </li>
  <li><strong>Added iothread</strong> to 2 VMs that were missing it. In Proxmox, enabling <code class="language-plaintext highlighter-rouge">iothread</code> on a virtio-scsi disk offloads I/O processing to a dedicated thread instead of sharing the main vCPU thread. This reduces latency and improves throughput, especially under heavy disk load. It’s a free performance win with no downside — the only catch is it requires a brief VM restart to apply:</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qm <span class="nb">set </span>401 <span class="nt">--scsi0</span> &lt;storage&gt;:vm-401-disk-0,iothread<span class="o">=</span>1,size<span class="o">=</span>32G,ssd<span class="o">=</span>1
qm <span class="nb">set </span>555 <span class="nt">--scsi1</span> &lt;storage&gt;:vm-555-disk-0,iothread<span class="o">=</span>1,size<span class="o">=</span>100G
</code></pre></div></div>

<ul>
  <li><strong>Migrated 6 disks</strong> from directory storage to LVM-thin (live migration, no downtime):</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Move from local qcow2 to LVM-thin — live, no VM shutdown needed</span>
qm disk move 239 scsi0 CRUCIAL_SSD1 <span class="nt">--delete</span> 1
qm disk move 404 scsi0 CRUCIAL_SSD1 <span class="nt">--delete</span> 1
qm disk move 4070 scsi0 usb-crucial-ssd-1 <span class="nt">--delete</span> 1
</code></pre></div></div>

<h3 id="the-big-one-batch-os-upgrades">The Big One: Batch OS Upgrades</h3>

<p>Seven VMs needed upgrading from Ubuntu 22.04 to 24.04. Rather than doing them one at a time over several weeks, I decided to batch them all at once. The reasoning: if the upgrade process has a systemic issue (like a broken package or incompatible config), I’d rather find out across all VMs simultaneously and fix it once, than discover it seven separate times.</p>

<p>The trick was deploying an upgrade script to each VM via <code class="language-plaintext highlighter-rouge">qm guest exec</code> (base64 encoded to avoid quoting hell), then launching it as a systemd transient service so it persists after the guest exec connection drops:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Deploy script via guest agent (base64 avoids shell quoting nightmares)</span>
qm guest <span class="nb">exec</span> <span class="nv">$vmid</span> <span class="nt">--</span> bash <span class="nt">-c</span> <span class="s1">'echo &lt;base64_script&gt; | base64 -d &gt; /root/do-upgrade.sh &amp;&amp; chmod +x /root/do-upgrade.sh'</span>

<span class="c"># Launch as a persistent service that survives the guest exec timeout</span>
qm guest <span class="nb">exec</span> <span class="nv">$vmid</span> <span class="nt">--</span> systemd-run <span class="nt">--unit</span><span class="o">=</span>os-upgrade /root/do-upgrade.sh
</code></pre></div></div>

<p>The script itself:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nb">export </span><span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
apt-get update <span class="nt">-qq</span>
<span class="k">do</span><span class="nt">-release-upgrade</span> <span class="nt">-f</span> DistUpgradeViewNonInteractive
reboot
</code></pre></div></div>

<p>I launched all seven simultaneously across both hosts and monitored for failures. Six of seven upgraded cleanly. One needed a dpkg repair via chroot after the upgrade interrupted mid-package-install — nothing a <code class="language-plaintext highlighter-rouge">dpkg --configure -a</code> couldn’t fix.</p>

<p>After the upgrades, one VM triggered a Zabbix disk space alert — the OS upgrade consumed enough extra space to push it over 80%. Turned out the 32GB virtual disk only had 15GB allocated to LVM with 17GB sitting unused. A quick <code class="language-plaintext highlighter-rouge">lvextend</code> and <code class="language-plaintext highlighter-rouge">resize2fs</code> sorted it without even needing to resize the virtual disk.</p>

<h3 id="cleanup">Cleanup</h3>

<ul>
  <li><strong>Archived 20 stale Zabbix hosts</strong> — old VMs, deleted devices, test entries. Tagged them <code class="language-plaintext highlighter-rouge">archived=true</code> via the API rather than deleting, in case I need to reference the historical data.</li>
  <li><strong>Added missing tags</strong> to VM configs for consistency</li>
  <li><strong>Fixed backup job references</strong> — removed non-existent VMIDs and added newly created VMs</li>
</ul>

<h2 id="lessons-learned">Lessons Learned</h2>

<ol>
  <li>
    <p><strong>Audit regularly.</strong> Technical debt compounds silently. Four VMs with broken monitoring could have been months of invisible outages.</p>
  </li>
  <li>
    <p><strong>Don’t put VM disks on directory storage.</strong> LVM-thin is better in almost every way — thin provisioning, better I/O, proper snapshot support. Reserve <code class="language-plaintext highlighter-rouge">local</code> for ISOs and templates.</p>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">systemd-run</code> is your friend.</strong> When you need to launch a long-running process via <code class="language-plaintext highlighter-rouge">qm guest exec</code> that would otherwise time out, <code class="language-plaintext highlighter-rouge">systemd-run --unit=name /path/to/script</code> creates a persistent service that survives the connection drop.</p>
  </li>
  <li>
    <p><strong>Unattended-upgrades needs configuration, not just installation.</strong> The package alone does nothing — you need the <code class="language-plaintext highlighter-rouge">20auto-upgrades</code> and <code class="language-plaintext highlighter-rouge">50unattended-upgrades</code> config files with the right origins. Even if you have Ansible handling updates, it’s worth having as a safety net.</p>
  </li>
  <li>
    <p><strong>Batch your upgrades and monitor for failures.</strong> Doing OS upgrades one-by-one across weeks means you discover the same issues seven times. Batching them lets you catch systemic problems early and fix them once.</p>
  </li>
  <li>
    <p><strong>Base64 encode scripts</strong> when passing them through multiple layers of SSH/shell quoting. Saves hours of escaping hell.</p>
  </li>
</ol>

<h2 id="whats-left">What’s Left</h2>

<ul>
  <li><strong>PBS-S3 optimisation</strong> — My S3-backed PBS datastore kept dropping connections under load during the pre-flight backups. Needs a separate deep dive into cache management and retention policies.</li>
</ul>

<h2 id="final-state">Final State</h2>

<p>27 VMs audited. 15 remediation steps executed and verified. 7 OS upgrades. 6 disk migrations. 4 monitoring fixes. 20 stale hosts archived. Zero data lost.</p>]]></content><author><name>Joshua Mein</name></author><category term="Homelab" /><category term="DevOps" /><category term="proxmox" /><category term="linux" /><category term="automation" /><category term="zabbix" /><category term="monitoring" /><summary type="html"><![CDATA[A comprehensive audit of 27 VMs across 3 Proxmox hosts revealed critical storage issues, broken monitoring, outdated operating systems, and years of accumulated tech debt. Here's how I fixed it all in one session.]]></summary></entry><entry><title type="html">Outlook Classic Not Syncing New Gmail Folders</title><link href="https://joshwaamein.github.io/posts/outlook-classic-not-syncing-new-gmail-folders-the-ost-fix/" rel="alternate" type="text/html" title="Outlook Classic Not Syncing New Gmail Folders" /><published>2026-03-26T12:00:00+00:00</published><updated>2026-03-26T12:00:00+00:00</updated><id>https://joshwaamein.github.io/posts/outlook-classic-not-syncing-new-gmail-folders-the-ost-fix</id><content type="html" xml:base="https://joshwaamein.github.io/posts/outlook-classic-not-syncing-new-gmail-folders-the-ost-fix/"><![CDATA[<p>A friend had their Gmail account set up in Outlook Classic on Windows using IMAP/SMTP. The problem: whenever they created new folders or labels in Gmail’s web UI, they’d show up on their iPhone and iPad immediately, but never in Outlook. I’d previously fixed it for them by manually editing the IMAP subscribed folders list, but didn’t want to keep doing that every time they created a new label.</p>

<h2 id="what-i-tried">What I Tried</h2>

<p>First, the obvious: unchecking <strong>“When displaying hierarchy in Outlook, show only subscribed folders”</strong> in the IMAP Folders dialog. Didn’t help on its own.</p>

<p>Querying folders in the IMAP Folders dialog confirmed the missing folder existed on the server — Outlook could see it was there. But none of the usual tricks worked:</p>

<ul>
  <li>Send/Receive — no change</li>
  <li>Collapsing and expanding the folder tree — no change</li>
  <li>Restarting Outlook — no change</li>
  <li>Unsubscribing and resubscribing — no change</li>
</ul>

<p>The folder was on the server. Outlook knew it was there. It just refused to display it.</p>

<h2 id="the-fix">The Fix</h2>

<p>Renaming the Gmail OST file with a <code class="language-plaintext highlighter-rouge">.bak</code> extension and relaunching Outlook forced a complete resync from the IMAP server. When Outlook starts and can’t find its OST file, it creates a new one and pulls everything down fresh. This was the only thing that reliably brought in the new folders.</p>

<p>The OST file lives at:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%localappdata%\Microsoft\Outlook\
</code></pre></div></div>

<h2 id="automating-it">Automating It</h2>

<p>Rather than having them manually rename the file every time, I wrote a batch script that does it on startup:</p>

<div class="language-batch highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@echo <span class="na">off</span>
<span class="c">:: Kill Outlook if running</span>
<span class="nb">taskkill</span> <span class="na">/f /im </span><span class="kd">OUTLOOK</span><span class="err">.EXE</span> <span class="o">&gt;</span><span class="kr">nul</span> <span class="m">2</span><span class="o">&gt;&amp;</span><span class="m">1</span>

<span class="c">:: Wait for the file to be released</span>
<span class="nb">timeout</span> <span class="na">/t </span><span class="m">3</span> <span class="na">/nobreak </span><span class="o">&gt;</span><span class="kr">nul</span>

<span class="c">:: Delete the Gmail OST file</span>
<span class="nb">del</span> <span class="s2">"</span><span class="nv">%localappdata%</span><span class="s2">\Microsoft\Outlook\*.ost"</span> <span class="na">/q </span><span class="o">&gt;</span><span class="kr">nul</span> <span class="m">2</span><span class="o">&gt;&amp;</span><span class="m">1</span>

<span class="c">:: Relaunch Outlook</span>
<span class="nb">start</span> <span class="s2">""</span> <span class="s2">"C:\Program Files (x86)\Microsoft Office\root\Office16\OUTLOOK.EXE"</span>
</code></pre></div></div>

<p>Saved as <code class="language-plaintext highlighter-rouge">OutlookFresh.bat</code> and dropped into the Windows startup folder (<code class="language-plaintext highlighter-rouge">shell:startup</code>). Now every time they log in, Outlook starts fresh with a full resync from Gmail’s servers. The OST rebuild takes a minute or two depending on mailbox size, but after that everything — including any new folders created on other devices — is there.</p>

<h2 id="why-this-works">Why This Works</h2>

<p>The OST file is Outlook’s local cache of the IMAP mailbox. When the folder structure changes server-side, Outlook is supposed to pick it up during sync. In practice, it sometimes doesn’t — especially with Gmail’s label-as-folder IMAP mapping, which has always been a bit odd. Deleting the cache and forcing a rebuild from scratch bypasses whatever state Outlook has gotten itself into.</p>

<p>It’s not elegant, but it’s reliable. And for a non-technical user who just wants their folders to appear, a startup script they never have to think about is the right solution.</p>

<p><strong>Environment:</strong> Windows, Outlook Classic (32-bit), Gmail via IMAP/SMTP.</p>]]></content><author><name>Joshua Mein</name></author><category term="Code" /><category term="windows" /><category term="outlook" /><category term="gmail" /><category term="imap" /><category term="troubleshooting" /><summary type="html"><![CDATA[When new Gmail labels wouldn't appear in Outlook Classic despite existing on the IMAP server, the only reliable fix was deleting the OST file to force a full resync. Here's the batch script that automates it on every startup.]]></summary></entry></feed>