Gunicorn and Uvicorn in production — the worker tuning we actually apply

The default Gunicorn config that ships in most Python projects is gunicorn app:app. No worker count, no worker class, no timeout, no keepalive — the defaults plus whatever the framework's Procfile suggested in 2017. For a tutorial app, fine. For a service taking real traffic, it is the difference between a load test that hits 4,000 RPS and one that wedges at 200 with a queue of timed-out requests.

This post walks through the tuning we apply to every managed Python service we run — what each knob does, what we set it to, and the failure modes that drove us to those values.

The four-question decision tree

Before any tuning, four questions:

Is the workload sync or async? WSGI (Flask, Django pre-async, older FastAPI mounts) vs ASGI (FastAPI, Starlette, Django Channels).
Is the workload CPU-bound or I/O-bound? Image processing, ML inference, large JSON serialisation are CPU. Most HTTP CRUD against a database is I/O.
What is the request profile? Short and uniform (50ms median, 200ms p99), or long-tailed (some requests take 30 seconds)?
What is the memory ceiling per worker? A worker with a 500MB resident set is a different math problem from one with 80MB.

The answers determine worker class, worker count, threads per worker, and timeouts. Get them wrong and the symptoms are remarkably consistent: 502s under load, CPU stuck at 40% while the request queue grows, OOM-killer events at 3am.

Worker class: pick exactly one model

Gunicorn ships several worker classes. The ones we use in production:

sync — one request per worker at a time. The default. Boringly correct for WSGI apps where requests are short and CPU-light per request. The fork model means one slow request blocks only its worker, not the process group.
gthread — one process, multiple threads. Useful for WSGI apps with I/O-heavy requests (database calls, downstream HTTP) where you want to share memory across requests but not pay the process overhead.
uvicorn.workers.UvicornWorker — ASGI worker, the canonical choice for FastAPI / Starlette / async Django. This is what we run for nearly every new service.

We do not use gevent or eventlet workers in production any more. Monkey-patching the standard library to make sync code "magically async" has too many sharp edges with modern dependencies (anything using native extensions, anything using asyncio internally), and the win over a properly-tuned ASGI stack is negative.

The Gunicorn + Uvicorn workers pattern

Our default for ASGI services:

gunicorn app.main:app \
  --worker-class uvicorn.workers.UvicornWorker \
  --workers 4 \
  --bind 0.0.0.0:8000 \
  --timeout 60 \
  --graceful-timeout 30 \
  --keep-alive 5 \
  --max-requests 2000 \
  --max-requests-jitter 200 \
  --access-logfile - \
  --error-logfile - \
  --log-level info

Why this combination and not just uvicorn app.main:app --workers 4?

Gunicorn handles process supervision better. Worker lifecycle, signal handling, graceful reloads, --max-requests recycling. Uvicorn's standalone multi-worker mode is fine but lacks the operational hooks we want.
Uvicorn handles ASGI better than any Gunicorn-native worker. It is the reference ASGI server, maintained by the Starlette author. Inside a Gunicorn-managed process, it does what it does best.
The combination is what FastAPI's own docs recommend for production deployments behind a process supervisor.

For sync WSGI apps (Django, Flask), the equivalent is --worker-class gthread --threads 4.

Worker count math

The rule of thumb you see everywhere is (2 * CPU) + 1. That comes from Gunicorn's own docs and applies to sync workers on a single-core-per-worker assumption. It is correct for that case, and incorrect for almost every other case.

How we actually pick worker count:

ASGI (Uvicorn workers): one worker per CPU core, sometimes one less to leave a core for the kernel and sidecars. An async worker can serve many concurrent connections inside its event loop; adding more processes than cores just causes scheduler thrash and per-process memory overhead.

# What we put in our entrypoint scripts
import os
workers = int(os.environ.get("WEB_CONCURRENCY", os.cpu_count() or 1))

Sync WSGI: (2 * CPU) + 1 if requests are short and uniform; lower (just CPU + 1) if requests are long-tailed, because each long request blocks an entire worker and you want to keep tail latency under control.

gthread WSGI: CPU workers and 8-16 threads per worker for I/O-heavy workloads. The total concurrency is workers × threads.

Memory budget always wins over CPU math. If your worker resident set is 400MB and the box has 4GB RAM, you cannot run 16 workers no matter what (2 * CPU) + 1 says. We measure RSS under realistic load and pick worker count as min(cpu_math, available_ram / worker_rss * 0.7). The 0.7 leaves headroom for the kernel, log buffers, and the inevitable memory growth between deploys.

Timeouts: every single one matters

The Gunicorn default timeout is 30 seconds. We almost always change it, and we always set the related timeouts explicitly so the chain is coherent.

--timeout 60              # kill a worker if a single request takes >60s
--graceful-timeout 30     # on SIGTERM, give workers 30s to drain
--keep-alive 5            # how long to hold idle keepalive connections

The interplay with the load balancer matters more than the absolute numbers. We set:

LB idle timeout > Gunicorn keep-alive (typically LB at 60s, Gunicorn at 5s). This prevents the LB holding a connection that Gunicorn has already closed, which surfaces as 502s on the next request reusing that pool slot.
LB request timeout >= Gunicorn --timeout (typically both at 60s). If the LB times out first, the client sees a 504 but the Python worker keeps grinding away for another N seconds.
Pod terminationGracePeriodSeconds > --graceful-timeout (typically 45s vs 30s). The pod must outlive Gunicorn's drain, or Kubernetes SIGKILLs in the middle of in-flight requests.

For managed Python services on AWS running behind an ALB, the ALB defaults to 60s; we usually leave that and pin Gunicorn to match. For Cloudflare in front of a service, we drop the Gunicorn timeout to 30s and rely on Cloudflare's 100s edge timeout being the long pole.

`--max-requests`: the underrated knob

--max-requests 2000 --max-requests-jitter 200

Every worker is recycled after handling 2000 requests (plus a random jitter of up to 200 to avoid all workers recycling at once). This is the single highest-leverage flag for keeping a Python service stable over weeks of uptime.

Why? Python is not great at returning memory to the OS. CPython's allocator holds onto pages, native extensions leak, large objects fragment the heap. A worker that has handled 200,000 requests will have measurably more RSS than a freshly-forked one, even if the actual working set is the same. --max-requests puts a ceiling on this by ensuring no worker lives long enough to accumulate problems.

The jitter matters: without it, all workers hit the limit at the same moment and recycle together, briefly halving your capacity. With jitter, the recycling staggers.

Preload, or not?

--preload forks workers after the application has been imported in the master process. Pros: workers share read-only pages via copy-on-write, lower per-worker memory. Cons: any code with side effects at import time (database connection pools, gRPC channels, background threads) runs in the master and gets inherited oddly by workers — usually fatally.

We default to --preload off for ASGI services (most modern apps initialise async resources in lifespan handlers, which post-date the fork). We turn it on for older Django monoliths with cold-import-heavy modules where the memory saving is real and the side-effects are auditable.

Health checks and graceful shutdown

# app/main.py
from fastapi import FastAPI
 
app = FastAPI()
_shutting_down = False
 
@app.on_event("shutdown")
async def shutdown():
    global _shutting_down
    _shutting_down = True
 
@app.get("/healthz")
async def healthz():
    if _shutting_down:
        # 503 so the LB stops sending traffic before SIGTERM completes
        from fastapi import HTTPException
        raise HTTPException(status_code=503, detail="shutting down")
    return {"status": "ok"}
 
@app.get("/readyz")
async def readyz():
    # Check downstream deps here — DB ping, cache ping, etc.
    return {"status": "ready"}

The pattern: separate liveness (/healthz) from readiness (/readyz). Liveness flips to unhealthy the moment shutdown fires, so the load balancer drains before the worker actually stops accepting connections. Without this, every rolling deploy drops a handful of in-flight requests.

What we ship by default

For managed Python services we operate, every deployment gets:

An entrypoint that reads WEB_CONCURRENCY, WEB_TIMEOUT, and WEB_MAX_REQUESTS from env, with sensible defaults derived from CPU and memory limits.
Prometheus-format metrics on /metrics via prometheus-client, including per-worker request duration histograms and active request gauges.
A pre-stop hook that flips /healthz to 503 and sleeps for LB_DRAIN_SECONDS before Gunicorn even receives SIGTERM.
An init container that runs pip check against the installed environment and fails the pod if dependencies are inconsistent — a cheap guard against half-baked image builds.
Dashboards that plot active workers, request queue depth, p50/p95/p99 latency, and worker recycling rate. The recycling rate is the one nobody else watches and it tells you a lot about memory pressure.

The Gunicorn defaults are not malicious; they are just generic. Five minutes of tuning per service, applied consistently, gets you a Python tier that is boring under load — which is the only kind worth running. We are happy to review your current config if you suspect yours is doing more harm than good.

Sudhanshu K. is a Senior SRE at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). She has spent more hours staring at Gunicorn worker state transitions than she will admit at dinner parties.