Aller au contenu
EdgeServers
Blog

postgresql

The Postgres backup that nobody tests — and the 90-minute restore drill we run quarterly

Most production Postgres backups have never been restored. Here's the WAL-based PITR drill we run on every customer database, and what we always find.

16 mai 2026 · 9 min · par Sudhanshu K.

The Postgres backup that nobody tests — and the 90-minute restore drill we run quarterly

The single most common finding when we audit a new customer's managed Postgres setup: backups are running, backups are completing, nobody has ever tested a restore. The dashboard shows green ticks for 18 months. The DBA's confidence in their disaster recovery is high. The actual confidence anyone should have, in the absence of a restore drill, is zero.

This post walks through the Postgres restore drill we run quarterly on every customer database. It takes about 90 minutes, costs roughly $1 of cloud compute, and consistently surfaces problems that would otherwise only appear during a real outage.

What a Postgres backup actually needs to be

Before the drill, the prerequisites. A serious Postgres backup setup in 2026 has:

  • Continuous WAL archiving to a remote object store (S3, GCS, Azure Blob, Spaces)
  • Base backups at regular intervals (typically nightly), also to object storage
  • A tested retention policy — daily backups for 14 days, weekly for 3 months, monthly for 1 year
  • A demonstrated RPO and RTO — not a stated one, a demonstrated one
  • Encryption at rest with keys not stored alongside the backup
  • A WAL archive that actually has every WAL segment, with no gaps

The last item is the one that quietly breaks. WAL archiving works by Postgres calling archive_command on each WAL segment. If archive_command fails (network blip, bucket policy change, expired credentials) and Postgres isn't configured to retry forever, you get a gap. The base backup is fine; the WAL chain after the gap is unusable. Point-in-time recovery (PITR) is then constrained to "before the gap" — which might be days before "now."

We use pgBackRest for the customers who want feature richness, and WAL-G for the customers who want simplicity. Both correctly handle the WAL chain semantics; we don't recommend rolling your own with pg_basebackup and a cron job for anything production.

The drill: nine steps, ninety minutes

Step 1: Pick a recovery target (5 minutes)

The restore drill is more useful if it's point-in-time, not "restore the latest base backup." Pick a target timestamp from yesterday — somewhere in the middle of the business day, when there were active writes happening.

DRILL_TARGET="2026-05-15 14:23:00 UTC"

This forces the restore to use base backup + WAL replay up to a specific point. It exercises the parts of the system that matter most.

Step 2: Provision a clean target host (10 minutes)

Spin up a fresh VM of the same size and Postgres version as production. Same OS, same Postgres major version, same kernel. Do not restore onto the production host — the whole point is to verify that the backup is self-contained.

For managed customers on AWS, this is usually a one-line Terraform run that creates a one-off instance, runs the drill, and destroys itself when done.

Step 3: Install the same backup tool, same version (5 minutes)

sudo apt-get install postgresql-16 pgbackrest
# Configuration: copy from production
sudo cp /etc/pgbackrest/pgbackrest.conf /etc/pgbackrest/

Critical detail: the configuration must point at the same backup repository as production was writing to. We're not testing whether a different repo works; we're testing whether the production repo, today, can produce a working database.

Step 4: List the available backups (2 minutes)

pgbackrest --stanza=main info

This confirms the tool can authenticate, the bucket is reachable, and the backup metadata is present. Surprisingly often, this step fails — typically because the credentials we're using for the drill don't have read access to the bucket, only write. Add a "drill credentials" IAM role with read-only access and store it separately.

Step 5: Restore to the recovery target (30-50 minutes — the longest step)

sudo -u postgres pgbackrest --stanza=main \
  --type=time --target="2026-05-15 14:23:00 UTC" \
  --target-action=promote \
  restore

This is where the actual work happens:

  1. pgBackRest pulls the latest base backup before the target time
  2. It pulls all WAL segments needed to replay from that base to the target
  3. It writes them to the local Postgres data directory
  4. When Postgres starts, it replays WAL up to the target timestamp and promotes

Time for this step depends on database size. A 100GB database with steady write volume takes 30-45 minutes. A 1TB database can take 4-8 hours. Knowing this number is critical — it's your real RTO, and most teams have never measured it.

Step 6: Start Postgres and verify (10 minutes)

sudo systemctl start postgresql
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
# Should be: f
sudo -u postgres psql -c "SELECT now();"
sudo -u postgres psql -c "SELECT MAX(created_at) FROM orders;"
# Should be at or just before 14:23:00 UTC

The third query is the proof. It confirms WAL was actually replayed and the database state matches the target time, not the base backup time.

Step 7: Run application-level smoke tests (10 minutes)

This is what separates a "Postgres works" check from a real recovery validation:

  • Query the 5 highest-volume tables and check row counts against staging snapshots
  • Run a small set of application-level integration tests (READ-only)
  • If the app has a "verify integrity" report, run it
  • Check that any non-default extensions (pg_trgm, postgis, etc.) loaded correctly

Things this catches that the previous step misses:

  • Extension version mismatches (your base backup was made with PostGIS 3.4, your drill host has PostGIS 3.5)
  • Tablespace path differences (your prod had a custom tablespace, your drill host doesn't)
  • Replication slot drift (slots get restored but reference WAL that's no longer relevant)

Step 8: Tear down (10 minutes)

Destroy the drill host. Keep nothing. The drill is finished, and leaving the drill host around invites someone to accidentally point production traffic at it.

Step 9: Document findings (15 minutes)

Even a successful drill yields findings. Things we typically log:

  • Actual measured time for each phase (so the next drill has a baseline)
  • Any errors or warnings during restore
  • Any drift between the drill-restored data and a snapshot of production
  • Improvements identified (e.g., "the WAL pull phase was network-bound; consider running the drill on an instance type with more network bandwidth")

Every drill is a small report that goes to the customer. Over a year, the trends in those reports tell you a lot more about your DR posture than any single audit.

What the drill consistently finds

Across the managed Postgres databases we run, these are the issues we surface during routine drills:

WAL chain gaps. The most common finding. The archive_command failed silently months ago. New backups are fine, but PITR to anything older than the gap is impossible. Fix: monitor pg_stat_archiver actively; alert on failed_count > 0 or on last_archived_time being more than 60 seconds old.

The "I can restore but I can't connect" failure. Postgres comes up, but pg_hba.conf and certificates aren't part of the backup, and the drill host's defaults reject the connection. Fix: include the cluster config in a separate backup (or, better, in your IaC).

Missing extensions. Restore succeeds, application connects, application explodes because pg_partman isn't installed on the drill host. Fix: have the drill host's provisioning script install everything the database expects.

Tablespace path baked into the backup. Production used /data/postgres-fast; drill host has /var/lib/postgresql. Restore fails partway through. Fix: relocate tablespaces, or use the --tablespace-map flag to rewrite paths on restore.

Performance after restore is unexpectedly bad. This one's subtle — the restored database is missing the planner statistics, which pg_stat_statements was carrying. The first query against any large table takes 5x longer than in production. Fix: ANALYZE everything as part of the post-restore checklist.

Credentials expired. Backup credentials rotate, drill credentials don't. The drill fails at step 4. Fix: tie drill credentials to the same rotation as backup credentials, or use short-lived role-based credentials everywhere.

How often is "regularly"?

Our standard:

  • Monthly: automated PITR drill (fully scripted, runs in CI, reports to dashboard)
  • Quarterly: full restore + application smoke tests, with a human reading the output
  • Annually: full cross-region DR drill, including DNS failover and stakeholder communication

The monthly automation is the load-bearing item. It's the difference between "we tested in March and have no idea if it still works in October" and "the last drill was 18 hours ago and it was green."

What we ship by default

For managed Postgres customers on any cloud, every database gets:

  • pgBackRest or WAL-G configured against object storage (encrypted, lifecycle policies set)
  • A monitoring rule alerting on archive_command failure within 60 seconds
  • A monthly automated PITR drill that posts results to a customer-shared dashboard
  • A quarterly hand-driven drill with a written report
  • A documented RPO (typically 60 seconds) and RTO (typically 30 minutes for <100GB, scaling up from there)

If your team has been running Postgres for years and has never restored from backup as an exercise, you're not unusual. But the gap matters more in 2026 than it did before — database compromise and ransomware events are routine enough that "we have backups" is no longer a defence; "we have tested backups" is.

Reach out if you'd like us to run a drill against your existing setup. It usually takes a day and produces a remediation list that's small, concrete, and worth doing.

Sudhanshu K. is a Staff DevOps engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). He has personally restored from backup in anger four times, which is four times more than most DBAs ever have to.