Run a PostgreSQL PITR restore drill every week (here's our runbook)
May 10, 2026 · 1 min read · by Sudhanshu K.
A backup that has never been restored is a hope, not a backup. Every managed Postgres database we run goes through a weekly point-in-time recovery (PITR) drill, automated end-to-end, with the result published to your monthly health report.
What the drill actually does
- Pick a random timestamp within the last 7 days of WAL retention.
- Provision a sandbox instance in a cheaper instance class than production.
- Restore the latest base backup into the sandbox.
- Replay WAL up to the chosen timestamp.
- Run an integrity probe — schema diff against prod, row-count delta within tolerance, a handful of integrity SQL checks.
- Tear down the sandbox. Charge ~5 minutes of compute. Report success or failure.
What it catches
The four things this drill catches that "the backup ran" alerts don't:
- Backups silently switching from WAL+base to "just snapshots" because someone changed the retention policy
- WAL gaps caused by
archive_commandfailures that nobody noticed because the database kept running - Permission drift that lets backups happen but blocks the restore role from reading them
- Encryption key rotation that bricked the older snapshots
What we don't automate
We don't automate the disaster declaration. The runbook explicitly requires a human to decide we are doing a real restore, not the drill. That guardrail has saved us from "automation runs amok and restores production from yesterday" more than once.
The full runbook on Medium has the Terraform we use to spin the sandbox, the Bash that compares schemas, and the alert rules that escalate when the drill fails twice in a row.
Full article available
Read the full article