Scaling Django past the easy plateau.

Prelude

Django scales further than its critics admit and further than its fans pretend. The interesting failure mode isn't the framework. It's the gradual drift of operational discipline as the team grows. The patterns below are what we put in place before a Django app becomes the system the business depends on.

Pooling, or "where did all my connections go?"

The first scaling wall most teams hit is connections. A modest fleet of Gunicorn workers × Celery workers × threads will trivially exceed your Postgres max_connections at the worst possible moment.

The fix is rarely "raise max_connections." It's a layered pooling strategy:

PgBouncer in transaction mode in front of Postgres. Sized to your Postgres connection limit, not your fleet.
Django's CONN_MAX_AGE set to a sensible non-zero value for HTTP workers, zero for short-lived tasks.
Connection accounting exposed as a metric. Alert when in-use connections approach 80% of pool size.

It's a pooling problem

If your Postgres dashboard shows connection count climbing in lockstep with deploys, you don't have a Postgres problem. You have a pooling problem.

Per-request query budgets.

N+1 queries are the most reliable way to slow a Django app, and the most reliable way to miss them is to wait until they're a problem. We set a per-endpoint query budget on every project (typically 10–20 queries) and ship a middleware that warns in dev and logs in production whenever a request exceeds it.

middleware.pypython

# django middleware enforcing per-request query budgets
class QueryBudgetMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        budget = settings.QUERY_BUDGETS.get(
            request.resolver_match.view_name, 20
        )
        before = len(connection.queries_log)
        response = self.get_response(request)
        after = len(connection.queries_log)
        if after - before > budget:
            logger.warning("query_budget_exceeded",
                view=request.resolver_match.view_name,
                count=after - before, budget=budget,
            )
        return response

Async work, kept boring.

The point isn't to make every endpoint fast. It's to make the cost of every endpoint visible. A 40-query endpoint with an explicit budget of 50 is a known cost. A 40-query endpoint with no budget is a slow regression waiting to happen.

Celery is the default for a reason. Most "we hit a Celery wall" stories we've audited are actually shape problems, not throughput problems. The patterns that hold up:

Task idempotency, enforced by the database. A unique constraint per (entity, operation) is the cheapest, most reliable form of "exactly-once."
Short tasks. If a task takes more than ~60 seconds, it's almost always doing two things. Split it.
Per-queue worker pools. One ill-behaved task category should not starve everything else. Slow queues, fast queues, and time-bounded queues are different rooms.
Retry policies that don't lie. Exponential backoff with a hard upper bound and a dead-letter destination. Forever-retry is a bug.

We replaced 'add another worker' with 'understand the task shape' and threw out half our infrastructure.

Lean on Postgres.

The single biggest performance unlock in any Django app we've worked on is moving work into Postgres that was being done in Python. The framework makes it slightly too easy to write the wrong-shaped query.

Aggregations belong in Postgres. annotate + aggregate over loops; subqueries over second-pass filtering.
JSONB for high-cardinality optional fields. Beats EAV tables on every dimension that matters.
Window functions for running totals, ranks, lag/lead. Don't reach for Python; reach for SQL.
Read replicas with explicit routing. Django's database router is enough. Don't reinvent it.
Partial & expression indexes for the queries that actually matter, not blanket indexing.

Deploy discipline.

The last bucket isn't code, it's process. The teams that scale Django smoothly all share the same operational habits:

Expand-then-contract migrations over single-step schema changes. Add column, write to both, backfill, drop old. Days, not seconds.
Migration plans reviewed alongside the PR. Locking behavior surfaced explicitly. Backfill costs estimated.
Feature flags for risky paths. Not for everything, only where a regression would be a revenue event.
Read-shadowing. A new query path runs alongside the old one and is diffed before it serves traffic.

None of this is novel. It's the discipline that lets Django carry a real business, and the same discipline that lets you stay on it rather than embarking on a rewrite that, statistically, won't fix anything fundamental anyway.