Run a PostgreSQL PITR restore drill every week (here's our runbook)

A backup that has never been restored is a hope, not a backup. Every managed Postgres database we run goes through a weekly point-in-time recovery (PITR) drill, automated end-to-end, with the result published to your monthly health report.

What the drill actually does

Pick a random timestamp within the last 7 days of WAL retention.
Provision a sandbox instance in a cheaper instance class than production.
Restore the latest base backup into the sandbox.
Replay WAL up to the chosen timestamp.
Run an integrity probe — schema diff against prod, row-count delta within tolerance, a handful of integrity SQL checks.
Tear down the sandbox. Charge ~5 minutes of compute. Report success or failure.

What it catches

The four things this drill catches that "the backup ran" alerts don't:

Backups silently switching from WAL+base to "just snapshots" because someone changed the retention policy
WAL gaps caused by archive_command failures that nobody noticed because the database kept running
Permission drift that lets backups happen but blocks the restore role from reading them
Encryption key rotation that bricked the older snapshots

What we don't automate

We don't automate the disaster declaration. The runbook explicitly requires a human to decide we are doing a real restore, not the drill. That guardrail has saved us from "automation runs amok and restores production from yesterday" more than once.

The full runbook on Medium has the Terraform we use to spin the sandbox, the Bash that compares schemas, and the alert rules that escalate when the drill fails twice in a row.