Celery in production — broker choice, retry semantics, and what Flower actually tells you

Celery is one of those tools whose defaults are almost right. Almost. Default retry policy will retry forever. Default acks_late is False, so a worker crash drops the job. Default broker for "quick start" is Redis without persistence enabled, so a Redis restart loses every in-flight task.

This is the production Celery setup we run for managed Python customers — broker choice, worker config, retry policy, and the monitoring that actually surfaces problems.

A baseline task with the safe defaults

from celery import Celery, shared_task
 
app = Celery('app', broker='redis://redis:6379/0', backend='redis://redis:6379/1')
app.conf.update(
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    task_track_started=True,
    worker_prefetch_multiplier=1,
    broker_connection_retry_on_startup=True,
)
 
@shared_task(bind=True, autoretry_for=(IOError,), retry_backoff=True,
             retry_jitter=True, retry_kwargs={'max_retries': 5})
def send_email(self, to, subject):
    ...

acks_late=True and prefetch_multiplier=1 together mean: take one task at a time, only ack after success. Combined with retry_backoff and a max_retries cap, jobs don't disappear into a retry loop.

The full write-up covers:

Redis vs RabbitMQ: when each is the right broker
Idempotency — the property tasks should have but rarely do
Routing tasks to dedicated queues for prioritization and isolation
Flower: the metrics that matter (queue depth, task latency, retry rate)
Celery beat for scheduled tasks — and the leader-election story that nobody warns you about
Worker lifecycle: max-tasks-per-child, memory leaks, the OOM-killer interaction

We ship this Celery pattern on every managed Python stack with background work.