Optimizing Proxmox Backup Server with S3: Regional Migration and Fixing a Glacier Misconfiguration
How I investigated and resolved PBS S3 connection issues, migrated to a closer regional endpoint, and properly optimized backups after a Glacier lifecycle misconfiguration.
How I migrated my PBS S3 datastore to a closer regional endpoint, resolved a Glacier lifecycle misconfiguration, and properly optimized the setup
The Most Important Thing First: Glacier is Incompatible with PBS
Before anything else — if you’re running Proxmox Backup Server with an S3 backend, do not use Glacier lifecycle policies. This includes Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive.
PBS needs immediate, on-demand access to chunks for garbage collection, verification, deduplication, and restores. Glacier storage classes require retrieval requests that can take anywhere from milliseconds to 48 hours depending on tier. The moment PBS tries to access a Glaciered chunk, it fails. This breaks GC, verification, and restores silently or with cryptic errors.
The correct storage class for PBS S3 is S3 Intelligent-Tiering — it automatically moves infrequently accessed data to cheaper tiers, but everything remains immediately accessible with no retrieval delays or fees.
Background
I run a Proxmox homelab with multiple PVE nodes and PBS servers. One of my PBS servers uses AWS S3 as a backend for offsite backups. PBS 4.x supports S3 as a “technology preview” feature — it uses a local cache disk and syncs chunks to S3.
The setup had been running for several months and had accumulated a number of issues:
- Intermittent connection errors (“bytes remaining on stream”, “Transport endpoint not connected”)
- The S3 cache disk was growing without bound
- S3 costs were higher than expected due to Glacier retrieval fees
I decided to do a thorough investigation and fix everything properly.
The Investigation
Infrastructure Overview
| Component | Details |
|---|---|
| PBS Server | VM on Proxmox |
| S3 Backend | AWS S3 |
| Cache Disk | 850 GB ext4 |
Key Findings
1. Wrong Regional Endpoint The PBS server and the S3 bucket were in different regions. Every S3 API call was incurring unnecessary cross-region latency. With millions of small chunk objects, this latency compounds significantly — S3 is a high-request-count workload.
2. Glacier Lifecycle Disaster A lifecycle policy was transitioning objects through Glacier tiers:
- Day 14 → Glacier Instant Retrieval
- Day 104 → Glacier Flexible Retrieval
- Day 194 → Glacier Deep Archive
As covered above, this is fundamentally incompatible with PBS. It was silently breaking GC and verification, and would have made restores impossible for older backups.
3. Unbounded Cache Growth The 850 GB cache disk was 65% full with 1.67M chunk files across 65,536 subdirectories. PBS docs recommend only 64–128 GiB for the cache.
Cache breakdown:
- ~71% of chunks were 0-byte marker files (cache index markers)
- ~29% contained actual cached data
- Chunks from months ago were still in the cache
- No automatic cache eviction exists in this PBS version
4. TCP Keepalive Too Slow
Default tcp_keepalive_time was 7200 seconds (2 hours). Dead S3 connections weren’t detected for hours, causing the “Transport endpoint not connected” errors. High latency to a distant S3 region made this worse — more connections timing out silently.
5. Ext4 Wasted Space The cache disk had 4.18% reserved blocks — about 37 GB wasted on a disk where root reservation serves no purpose.
6. GC Schedule Needed Review Garbage collection frequency needs careful consideration with S3 backends — every GC run makes a large number of LIST and HEAD API calls against S3, which cost money. Running GC too frequently wastes money; too infrequently leaves orphaned chunks accumulating. Weekly is a reasonable balance for most setups.
7. S3 Endpoint Style
Using path-style addressing (s3.amazonaws.com/bucket/key) instead of the recommended vhost-style (bucket.s3.region.amazonaws.com/key).
The Optimizations
Phase 1: No-Downtime Changes
1. GC Schedule
proxmox-backup-manager datastore update pbs-s3 --gc-schedule "sat 02:00"
Weekly GC on Saturday at 2am. Frequent enough to keep orphaned chunks in check, infrequent enough to keep S3 API costs reasonable.
2. Ext4 Reserved Blocks: 4.18% → 1%
tune2fs -m 1 /dev/sdc
Freed ~28 GB immediately. No reason to reserve 37 GB for root on a cache disk.
3. TCP Keepalive Tuning
cat > /etc/sysctl.d/99-s3-tuning.conf << EOF
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_fin_timeout = 30
EOF
sysctl -p /etc/sysctl.d/99-s3-tuning.conf
Dead S3 connections now detected in ~2 minutes instead of 2 hours. Essential when connection latency is non-trivial.
4. Ext4 Mount Options
Updated /etc/fstab:
UUID=<disk-uuid> /mnt/S3BackupCache ext4 noatime,commit=120 0 2
noatime— Eliminates metadata writes on every access across 1.67M filescommit=120— Reduces journal commit frequency (cache is reconstructible from S3)- UUID-based mount for stability across disk reorders
5. S3 Endpoint: Path-style → Vhost-style
proxmox-backup-manager s3 endpoint update pbs-s3 \
--endpoint '.s3..amazonaws.com' \
--region <your-region> \
--delete path-style
Direct regional routing rather than the global endpoint.
Phase 2: Restart Required
6. RAM: Increased
More RAM means better filesystem caching for 1.67M chunk files — the OS page cache can hold more of the chunk index in memory.
The Migration: Moving to a Closer Region
The Problem
The PBS server and S3 bucket were in different regions. Every backup chunk upload and every GC/verification API call was crossing region boundaries. This was the root cause of the elevated latency and connection instability.
Step 1: Create New Bucket in the Correct Region
aws s3api create-bucket \
--bucket <your-new-bucket-name> \
--region <closer-region> \
--create-bucket-configuration LocationConstraint=<closer-region>
Configured with:
- S3 Intelligent-Tiering lifecycle (no Glacier)
- Server-side encryption
- Randomized bucket name for security
Step 2: Restore Glacier Objects
The biggest challenge — over 860,000 objects were in Glacier or Deep Archive and needed to be restored before they could be copied.
| Storage Class | Objects |
|---|---|
| STANDARD | ~828,000 |
| GLACIER_IR | ~212,000 |
| GLACIER | ~433,000 |
| DEEP_ARCHIVE | ~430,000 |
First Attempt: Individual API Calls (Too Slow)
Started with parallel aws s3api restore-object calls. At ~1–2 seconds per call with 860K objects, this would have taken days.
Solution: S3 Batch Operations
Used S3 Batch Operations to restore all Glacier objects server-side:
- Generated a CSV manifest of all Glacier objects
- Created an IAM role for batch operations
- Submitted the batch job via the AWS console
Result: ~810,000 succeeded, ~51,000 “failed” with RestoreAlreadyInProgress (from our earlier individual attempts — not real failures). Completed in ~2 hours entirely on AWS infrastructure.
Step 3: Copy Data to New Region
Standard Objects (aws s3 sync)
aws s3 sync s3://<source-bucket> s3://<dest-bucket> \
--region <dest-region> \
--source-region <source-region> \
--storage-class INTELLIGENT_TIERING
However, aws s3 sync refuses to copy objects with GLACIER storage class — even after they’ve been restored.
Glacier Objects (boto3)
Used Python boto3 to copy the restored Glacier objects:
from concurrent.futures import ThreadPoolExecutor
import boto3
s3_dst = boto3.client('s3', region_name='<dest-region>')
def copy_one(key):
s3_dst.copy_object(
Bucket='<dest-bucket>',
Key=key,
CopySource={'Bucket': '<source-bucket>', 'Key': key},
StorageClass='INTELLIGENT_TIERING'
)
with ThreadPoolExecutor(max_workers=20) as executor:
executor.map(copy_one, glacier_keys)
Result: ~833,000 objects copied, 0 failures. ✅
Step 4: Switch PBS to New Bucket
# Maintenance mode
proxmox-backup-manager datastore update pbs-s3 \
--maintenance-mode 'type=offline,message="Migrating region"'
# Update endpoint region
proxmox-backup-manager s3 endpoint update pbs-s3 --region <new-region>
# Update bucket name in config
sed -i 's/bucket=<old-bucket>/bucket=<new-bucket>/' \
/etc/proxmox-backup/datastore.cfg
# Verify connectivity
proxmox-backup-manager s3 check pbs-s3 <new-bucket>
# Remove maintenance mode
proxmox-backup-manager datastore update pbs-s3 --delete maintenance-mode
Step 5: Full Verification
proxmox-backup-manager verify-job update <verify-job-id> --ignore-verified false
proxmox-backup-manager verify-job run <verify-job-id>
Lifecycle Policy: The Right Way
❌ Wrong (What I Had)
Day 0 → S3 Standard
Day 14 → Glacier Instant Retrieval
Day 104 → Glacier Flexible Retrieval
Day 194 → Glacier Deep Archive
This breaks PBS completely — GC, verification, dedup, and restores all require immediate chunk access.
✅ Correct
{
"Rules": [{
"ID": "pbs-intelligent-tiering",
"Status": "Enabled",
"Filter": {},
"Transitions": [{
"Days": 1,
"StorageClass": "INTELLIGENT_TIERING"
}]
}]
}
S3 Intelligent-Tiering automatically moves infrequently accessed data to cheaper tiers, but everything remains immediately accessible with no retrieval fees or delays.
Cache Disk Shrink
After migration, the cache disk was shrunk from 850 GB to 128 GiB:
- Add new smaller disk to the VM
- Put datastore in maintenance mode, stop proxy
- Format new disk:
mkfs.ext4 -L S3BackupCache /dev/sdX && tune2fs -m 1 /dev/sdX - Update
/etc/fstabwith UUID of new disk - Mount, start proxy, remove maintenance mode
- Run
proxmox-backup-manager datastore s3-refresh pbs-s3— this pulls all manifest/index files from S3 so existing backups become visible in the new cache - Remove old disk
Important: After replacing the cache disk, run
s3-refresh. The new disk starts empty — PBS won’t know about existing S3 backups until the manifests are downloaded. This is a one-time operation.
Before & After
| Metric | Before | After |
|---|---|---|
| S3 Region | Distant region | Closer regional endpoint |
| API Latency | High | Low |
| Endpoint Style | path-style | vhost-style |
| Lifecycle | Glacier cascade | Intelligent-Tiering |
| GC Frequency | Monthly | Weekly |
| TCP Keepalive | 2 hours | 60 seconds |
| Mount Options | defaults | noatime,commit=120 |
| Reserved Blocks | 4.18% (37 GB wasted) | 1% |
| Cache Disk | 850 GB (unbounded) | 128 GiB |
| Connection Errors | Frequent | Gone |
| Backup Performance | Unoptimised | Optimised |
Lessons Learned
-
Never use Glacier lifecycle policies with PBS S3. PBS needs immediate access to all chunks. Use Intelligent-Tiering instead. Check this before doing anything else.
-
S3 region matters. Put the bucket in the same or closest available region to the PBS server. Cross-region latency compounds badly with high object counts.
-
GC frequency vs. S3 API cost is a real tradeoff. Every GC run makes thousands of API calls. Don’t run it more frequently than necessary — weekly is a good default for most homelab setups.
-
TCP keepalive tuning is critical for S3. The default 2-hour timeout means dead connections go undetected. With any meaningful latency, this causes intermittent backup failures.
-
The PBS S3 cache needs deliberate sizing. 64–128 GiB is recommended. An oversized cache disk just fills with stale data and is never evicted.
-
After replacing the cache disk, run
s3-refresh. The new disk starts empty — existing S3 backups won’t be visible until manifests are downloaded. -
aws s3 syncwon’t copy GLACIER-class objects even when restored. Use boto3copy_object()for those. -
ext4
noatimeis essential with millions of small files. Every read normally updates access time metadata — eliminating this overhead makes a noticeable difference.
Tags: proxmox, pbs, s3, aws, glacier, backup, optimization, homelab