FastAPI on Kubernetes — the production deployment we ship by default

FastAPI has won. Across the managed Python services we run, it is now the default for any new HTTP API, and it has replaced enough Flask and DRF deployments that legacy is starting to mean "Django pre-async." The wins are real: type-safe request/response models, automatic OpenAPI documentation, async-native, and a development experience that genuinely accelerates new-service ramp-up.

The catch is that "starts a FastAPI service" and "runs FastAPI in production" are different problems. The framework's defaults are excellent for the former and silent about the latter. This post is the deployment template we apply to every new FastAPI service that lands on a managed Kubernetes cluster.

ASGI server: Uvicorn behind Gunicorn

FastAPI's docs recommend Uvicorn standalone for development and "Uvicorn behind Gunicorn" for production. That recommendation has held up.

# Dockerfile (excerpt)
CMD ["gunicorn", "app.main:app", \
     "--worker-class", "uvicorn.workers.UvicornWorker", \
     "--workers", "4", \
     "--bind", "0.0.0.0:8000", \
     "--timeout", "60", \
     "--graceful-timeout", "30", \
     "--keep-alive", "5", \
     "--max-requests", "2000", \
     "--max-requests-jitter", "200", \
     "--access-logfile", "-", \
     "--error-logfile", "-"]

The worker count is templated from environment in the entrypoint script (set from Kubernetes CPU requests, typically one worker per available CPU). The other knobs are constants we apply across services. The reasoning behind each knob — keep-alive, max-requests, graceful-timeout — is the same as for any Gunicorn deployment, and we wrote about it in detail elsewhere; the short version is "the defaults are for dev, not for boxes that take real traffic."

Hypercorn is the supported alternative if you need HTTP/2 or HTTP/3 termination at the application layer. We almost never do — termination happens at the ingress or load balancer, and the application speaks HTTP/1.1 over a local socket. If your situation is different, hypercorn is a drop-in replacement at the Gunicorn worker class level.

Pydantic v2: turn on every performance feature

The Pydantic v1 → v2 migration is largely complete in the ecosystem now, and the v2 performance characteristics are dramatically better — request validation that used to be 30% of CPU time on hot endpoints is closer to 5%. But there are still configuration knobs that affect performance materially.

from pydantic import BaseModel, ConfigDict
 
class CreateOrderRequest(BaseModel):
    model_config = ConfigDict(
        # Reject extra fields. Default is "ignore", which silently drops them.
        extra="forbid",
        # Skip the slow path of re-validating on assignment.
        validate_assignment=False,
        # Use enum values directly, no extra wrapping.
        use_enum_values=True,
        # Cache string lookups for field names — material speedup for hot models.
        str_strip_whitespace=False,
    )
    
    customer_id: str
    items: list[OrderItem]
    notes: str | None = None

extra="forbid" is the one we argue about with customers most often. It means a request with a typo'd field name ({"customer_id": ..., "ittems": [...]}) is rejected with a 422 rather than silently treated as an empty items list. We think rejection is correct: it surfaces client bugs at the boundary. The pushback is "but clients might evolve" — to which the answer is "version your API, do not let clients smuggle unknown fields through validation."

For response models, the equivalent best practice is to use response_model= on the route, not just type-annotate the return:

@app.post("/orders", response_model=OrderResponse, status_code=201)
async def create_order(payload: CreateOrderRequest) -> OrderResponse:
    order = await order_service.create(payload)
    return order

This guarantees the response is filtered through OrderResponse, so a database object that grew an internal_admin_notes field last week does not accidentally leak into the API.

OpenAPI: commit the spec, lint it in CI

FastAPI generates OpenAPI automatically from the route signatures. The mistake is to treat this as a runtime convenience rather than a versioned artefact.

Our pattern: generate the OpenAPI JSON at build time, commit it to the repo, and lint it in CI.

# scripts/dump_openapi.py
import json
from app.main import app
print(json.dumps(app.openapi(), indent=2, sort_keys=True))

# .github/workflows/openapi.yml
name: openapi
on: [pull_request]
jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install -r requirements.lock
      - run: python scripts/dump_openapi.py > generated.json
      - run: diff generated.json openapi.json
      - run: npx @stoplight/spectral-cli lint openapi.json

What this catches:

A route gets renamed or removed without intention — the diff against the committed spec flags it before merge.
A response schema changes incompatibly — spectral lint catches the breaking change pattern.
A new endpoint is added without documentation — the schema-change PR is the moment to ask "should this be documented as summary and description?"

Clients (mobile, web, third-party integrations) generate code from openapi.json in the repo, not from the running service. The running service is a moving target; the committed spec is a contract.

Health checks: liveness vs readiness, properly separated

A common mistake is /healthz that checks the database. When the database briefly hiccups, every pod fails its liveness probe, Kubernetes restarts them all, none of them can connect to the recovering database, and you have escalated a 15-second blip into a 10-minute outage.

The pattern:

from fastapi import FastAPI, HTTPException, status
 
app = FastAPI()
_state = {"shutting_down": False, "ready": False}
 
@app.on_event("startup")
async def startup():
    # Warm connection pools, load models, etc.
    await db.connect()
    await cache.connect()
    _state["ready"] = True
 
@app.on_event("shutdown")
async def shutdown():
    _state["shutting_down"] = True
    await db.disconnect()
    await cache.disconnect()
 
@app.get("/healthz")
async def healthz():
    # Liveness: am I running? Restart me if not.
    # Does NOT check downstream — those are someone else's problem.
    if _state["shutting_down"]:
        raise HTTPException(status.HTTP_503_SERVICE_UNAVAILABLE)
    return {"status": "ok"}
 
@app.get("/readyz")
async def readyz():
    # Readiness: should I receive traffic right now?
    # Checks downstream. If DB is down, stop receiving traffic.
    if _state["shutting_down"] or not _state["ready"]:
        raise HTTPException(status.HTTP_503_SERVICE_UNAVAILABLE)
    try:
        await db.ping(timeout=1.0)
    except Exception:
        raise HTTPException(status.HTTP_503_SERVICE_UNAVAILABLE)
    return {"status": "ready"}

And in the manifest:

# deployment.yaml (excerpt)
livenessProbe:
  httpGet: { path: /healthz, port: 8000 }
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet: { path: /readyz, port: 8000 }
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Liveness restarts the pod. Readiness only removes it from the Service endpoint set. They are not the same probe, and they should not check the same things.

The Kubernetes manifest we ship by default

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels: { app: orders-api }
  template:
    metadata:
      labels: { app: orders-api }
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: app
          image: registry.example/orders-api:abc123
          ports: [{ containerPort: 8000 }]
          env:
            - name: WEB_CONCURRENCY
              value: "4"
          resources:
            requests: { cpu: 500m, memory: 512Mi }
            limits:   { cpu: 2000m, memory: 1Gi }
          livenessProbe:
            httpGet: { path: /healthz, port: 8000 }
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /readyz, port: 8000 }
            periodSeconds: 5
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]
          securityContext:
            runAsNonRoot: true
            runAsUser: 10001
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities: { drop: [ALL] }
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: orders-api
spec:
  minAvailable: 2
  selector:
    matchLabels: { app: orders-api }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: orders-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: orders-api
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource: { name: cpu, targetAverageUtilization: 65 }

Highlights:

maxUnavailable: 0 during rolling updates. Combined with replicas: 3 and PDB minAvailable: 2, this means we never drop below 2 healthy pods even during a deploy.
preStop: sleep 10 gives the load balancer time to observe the pod's readiness flip to 503 before SIGTERM hits Gunicorn. Without this, in-flight requests get cut off during deploys.
terminationGracePeriodSeconds: 45 > Gunicorn's --graceful-timeout 30 > preStop sleep 10. The chain has to line up.
readOnlyRootFilesystem: true and dropped capabilities. Python apps almost never need to write to root; if yours does, mount a tmpfs for the specific paths that need it.
HPA on CPU at 65%. Lower than 80% because Python under load gets non-linear past 75%, and scaling reactively from 80% means you spend a minute in pain before the new pods are ready.

Observability: not optional

Every FastAPI service we deploy gets:

Structured JSON logs (via structlog or python-json-logger), with trace ID propagation from the incoming request.
OpenTelemetry instrumentation for traces, with the auto-instrumentation for FastAPI, the database driver, and the HTTP client.
/metrics endpoint via prometheus-fastapi-instrumentator, scraped by Prometheus, with histograms (not summaries) for request duration so percentiles are reaggregatable.
A standard Grafana dashboard — request rate, error rate, p50/p95/p99 latency, by route, per pod, with the standard golden-signals layout.

This is the "boring observability" baseline. Without it, debugging a slow endpoint in production becomes a guessing game. With it, the first question after an alert is "which route, which downstream, what changed" and you can answer all three in a minute.

What we ship by default

For every FastAPI service deployed onto managed clusters (whether DOKS on DigitalOcean, EKS on AWS, AKS on Azure, or GKE on GCP):

A multi-stage Docker build producing a non-root, minimal-base image with the locked dependencies installed.
Gunicorn-with-Uvicorn-workers, tuned per-service from observed CPU and memory usage.
Pydantic v2 with extra="forbid" on request models, response_model= on every route.
OpenAPI committed to the repo, diffed in CI against the generated spec.
Separated liveness and readiness probes that check different things.
A standard Kubernetes manifest with PDB, HPA, maxUnavailable: 0, preStop hooks, and security context locked down.
OpenTelemetry, structured logs, Prometheus metrics, and the golden-signals dashboard, plumbed in before the service serves its first production request.

FastAPI is a good framework. None of the above is exotic; it is the operational scaffolding that turns a good framework into a service you can run for years without surprise. We apply it once per service and it stays paid off.

Sudhanshu K. is a Senior platform engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). He has shipped enough FastAPI services to have strong opinions about Pydantic config and zero patience for extra="ignore".