rose-ash/docs/scalability-plan.md

# Rose Ash Scalability Plan

## Context

The coop runs 6 Quart microservices (blog, market, cart, events, federation, account) + Art-DAG (L1/L2 FastAPI) on a single-node Docker Swarm. Current architecture handles light traffic but has concrete bottlenecks: single Postgres, single Redis (256MB), no replicas, single Hypercorn worker per container, sequential AP delivery, no circuit breakers, and 6 competing EventProcessors. The decoupling work (contracts, HTTP data/actions, fragment composition) is complete — the code is structurally ready to scale, but the deployment and runtime aren't.

This plan covers everything from quick config wins to massive federation scale, organized in 4 tiers. Each tier unlocks roughly an order of magnitude.

---

## TIER 0 — Deploy Existing Code + Quick Config

**Target: low thousands concurrent. Effort: hours.**

### T0.1: Separate Auth Redis

**Why:** Auth keys (`grant:*`, `did_auth:*`, `prompt:*`) on DB 15 share a single Redis instance (256MB, `allkeys-lru`). Cache pressure from fragment/page caching can silently evict auth state, causing spurious logouts.

**Files:**
- `docker-compose.yml` — add `redis-auth` service: `redis:7-alpine`, `--maxmemory 64mb --maxmemory-policy noeviction`
- `docker-compose.yml` — update `REDIS_AUTH_URL: redis://redis-auth:6379/0` in `x-app-env`

**No code changes** — `shared/infrastructure/auth_redis.py` already reads `REDIS_AUTH_URL` from env.

### T0.2: Bump Redis Memory

**File:** `docker-compose.yml` (line 168)
- Change `--maxmemory 256mb` to `--maxmemory 1gb` (or 512mb minimum)
- Keep `allkeys-lru` for the data Redis (fragments + page cache)

### T0.3: Deploy the Database Split

**What exists:** `_config/init-databases.sql` (creates 6 DBs) and `_config/split-databases.sh` (migrates table groups). Code in `session.py` lines 46-101 already creates separate engines when URLs differ. `bus.py`, `user_loader.py`, and `factory.py` all have conditional cross-DB paths.

**Files:**
- `docker-compose.yml` — per-service `DATABASE_URL` overrides (blog→`db_blog`, market→`db_market`, etc.) plus `DATABASE_URL_ACCOUNT`→`db_account`, `DATABASE_URL_FEDERATION`→`db_federation`
- `_config/split-databases.sh` — add `menu_nodes` + `container_relations` to ALL target DBs (small read-only tables needed by `get_navigation_tree()` in non-blog apps until T1.7 replaces this)

**Deployment:** run `init-databases.sql`, stop services, run `split-databases.sh`, update compose env, redeploy.

### T0.4: Add PgBouncer

**Why:** 6 apps x pool_size 5 + overflow 10 = 90 connections to one Postgres. After DB split + workers, this multiplies. Postgres default `max_connections=100` will be hit.

**Files:**
- `docker-compose.yml` — add `pgbouncer` service (transaction-mode pooling, `default_pool_size=20`, `max_client_conn=300`)
- `docker-compose.yml` — change all `DATABASE_URL` values from `@db:5432` to `@pgbouncer:6432`
- `shared/db/session.py` (lines 13-20) — add `pool_timeout=10`, `pool_recycle=1800`. Consider reducing `pool_size` to 3 since PgBouncer handles pooling.

### T0.5: Hypercorn Workers

**Why:** Single async event loop per container. CPU-bound work (Jinja2 rendering, RSA signing) blocks everything.

**Files:**
- All 6 `{app}/entrypoint.sh` — change Hypercorn command to:
  ```
  exec hypercorn "${APP_MODULE:-app:app}" --bind 0.0.0.0:${PORT:-8000} --workers ${WORKERS:-2} --keep-alive 75
  ```

**No code changes needed** — `EventProcessor` uses `SKIP LOCKED` so multiple workers competing is safe. Each worker gets its own httpx client singleton and SQLAlchemy pool (correct fork behavior).

**Depends on:** T0.4 (PgBouncer) — doubling workers doubles connection count.

---

## TIER 1 — Fix Hot Paths

**Target: tens of thousands concurrent. Effort: days.**

### T1.1: Concurrent AP Delivery

**The single biggest bottleneck.** `ap_delivery_handler.py` delivers to inboxes sequentially (lines 230-246). 100 followers = up to 100 x 15s = 25 minutes.

**File:** `shared/events/handlers/ap_delivery_handler.py`
- Add `DELIVERY_CONCURRENCY = 10` semaphore
- Replace sequential inbox loop with `asyncio.gather()` bounded by semaphore
- Reuse a single `httpx.AsyncClient` per `on_any_activity` call with explicit pool limits: `httpx.Limits(max_connections=20, max_keepalive_connections=10)`
- Batch `APDeliveryLog` inserts (one flush per activity, not per inbox)

**Result:** 100 followers drops from 25 min → ~2.5 min (10 concurrent) or less.

### T1.2: Fragment Circuit Breaker + Stale-While-Revalidate

**Why:** Every page render makes 3-4 internal HTTP calls (`fetch_fragments()`). A slow/down service blocks all page renders for up to 2s per fragment. No graceful degradation.

**File:** `shared/infrastructure/fragments.py`

**Circuit breaker** (add near line 27):
- Per-app `_CircuitState` tracking consecutive failures
- Threshold: 3 failures → circuit opens for 30s
- When open: skip HTTP, fall through to stale cache

**Stale-while-revalidate** (modify `fetch_fragment_cached()`, lines 246-288):
- Store fragments in Redis as `{"html": "...", "ts": 1234567890.0}` instead of plain string
- Soft TTL = normal TTL (30s). Hard TTL = soft + 300s (5 min stale window)
- Within soft TTL: return cached. Between soft and hard: return cached + background revalidate. Past hard: block on fetch. On fetch failure: return stale if available, empty string if not.

### T1.3: Partition Event Processors

**Why:** All 6 apps register the wildcard AP delivery handler via `register_shared_handlers()`. All 6 `EventProcessor` instances compete for every activity with `SKIP LOCKED`. 5 of 6 do wasted work on every public activity.

**Files:**
- `shared/events/handlers/__init__.py` — add `app_name` param to `register_shared_handlers()`. Only import `ap_delivery_handler` and `external_delivery_handler` when `app_name == "federation"`.
- `shared/infrastructure/factory.py` (line 277) — pass `name` to `register_shared_handlers(name)`
- `shared/infrastructure/factory.py` (line 41) — add `event_processor_all_origins: bool = False` param to `create_base_app()`
- `shared/infrastructure/factory.py` (line 271) — `EventProcessor(app_name=None if event_processor_all_origins else name)`
- `federation/app.py` — pass `event_processor_all_origins=True`

**Result:** Federation processes ALL activities (no origin filter) for delivery. Other apps process only their own origin activities for domain-specific handlers. No wasted lock contention.

### T1.4: httpx Client Pool Limits

**Why:** All three HTTP clients (`fragments.py`, `data_client.py`, `actions.py`) have no `limits` parameter. Default is 100 connections per client. With workers + replicas this fans out unbounded.

**Files:**
- `shared/infrastructure/fragments.py` (lines 38-45) — add `limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)`
- `shared/infrastructure/data_client.py` (lines 36-43) — same
- `shared/infrastructure/actions.py` (lines 37-44) — `max_connections=10, max_keepalive_connections=5`

### T1.5: Data Client Caching

**Why:** `fetch_data()` has no caching. `cart-summary` is called on every page load across 4 apps. Blog post lookups by slug happen on every market/events page.

**File:** `shared/infrastructure/data_client.py`
- Add `fetch_data_cached()` following the `fetch_fragment_cached()` pattern
- Redis key: `data:{app}:{query}:{sorted_params}`, default TTL=10s
- Same circuit breaker + SWR as T1.2

**Callers updated:** all `app.py` context functions that call `fetch_data()` for repeated reads.

### T1.6: Navigation via HTTP Data Endpoint

**Why:** After DB split, `get_navigation_tree()` queries `menu_nodes` via `g.s`. But `menu_nodes` lives in `db_blog`. The T0.3 workaround (replicate table) works short-term; this is the proper fix.

**Files:**
- `shared/contracts/dtos.py` — add `MenuNodeDTO`
- `blog/bp/data/routes.py` — add `nav-tree` handler returning `[dto_to_dict(node) for node in nodes]`
- All non-blog `app.py` context functions — replace `get_navigation_tree(g.s)` with `fetch_data_cached("blog", "nav-tree", ttl=60)`
- Remove `menu_nodes` from non-blog DBs in `split-databases.sh` (no longer needed)

**Depends on:** T0.3 (DB split), T1.5 (data caching).

### T1.7: Fix Fragment Batch Parser O(n^2)

**File:** `shared/infrastructure/fragments.py` `_parse_fragment_markers()` (lines 217-243)
- Replace nested loop with single-pass: find all `<!-- fragment:{key} -->` markers in one scan, extract content between consecutive markers
- O(n) instead of O(n^2)

### T1.8: Read Replicas

**Why:** After DB split, each domain DB has one writer. Read-heavy pages (listings, calendars, product pages) can saturate it.

**Files:**
- `docker-compose.yml` — add read replicas for high-traffic domains (blog, events)
- `shared/db/session.py` — add `DATABASE_URL_RO` env var, create read-only engine, add `get_read_session()` context manager
- All `bp/data/routes.py` and `bp/fragments/routes.py` — use read session (these endpoints are inherently read-only)

**Depends on:** T0.3 (DB split), T0.4 (PgBouncer).

---

## TIER 2 — Decouple the Runtime

**Target: hundreds of thousands concurrent. Effort: ~1 week.**

### T2.1: Edge-Side Fragment Composition (Nginx SSI)

**Why:** Currently every Quart app fetches 3-4 fragments per request via HTTP (`fetch_fragments()` in context processors). This adds latency and creates liveness coupling. SSI moves fragment assembly to Nginx, which caches each fragment independently.

**Changes:**
- `shared/infrastructure/jinja_setup.py` — change `_fragment()` Jinja global to emit SSI directives: `<!--#include virtual="/_ssi/{app}/{type}?{params}" -->`
- New Nginx config — map `/_ssi/{app}/{path}` to `http://{app}:8000/internal/fragments/{path}`, enable `ssi on`, proxy_cache with short TTL
- All fragment blueprint routes — add `Cache-Control: public, max-age={ttl}, stale-while-revalidate=300` headers
- Remove `fetch_fragments()` calls from all `app.py` context processors (templates emit SSI directly)

### T2.2: Replace Outbox Polling with Redis Streams

**Why:** `EventProcessor` polls `ap_activities` every 2s with `SELECT FOR UPDATE SKIP LOCKED`. This creates persistent DB load even when idle. LISTEN/NOTIFY helps but is fragile.

**Changes:**
- `shared/events/bus.py` `emit_activity()` — after writing to DB, also `XADD coop:activities:pending` with `{activity_id, origin_app, type}`
- `shared/events/processor.py` — replace `_poll_loop` + `_listen_for_notify` with `XREADGROUP` (blocking read, no polling). Consumer groups handle partitioning. `XPENDING` + `XCLAIM` replace the stuck activity reaper.
- Redis Stream config: `MAXLEN ~10000` to cap memory

### T2.3: CDN for Static Assets

- Route `*.rose-ash.com/static/*` through CDN (Cloudflare, BunnyCDN)
- `_asset_url()` already adds `?v={hash}` fingerprint — CDN can cache with `max-age=31536000, immutable`
- No code changes, just DNS + CDN config

### T2.4: Horizontal Scaling (Docker Swarm Replicas)

**Changes:**
- `docker-compose.yml` — add `replicas: 2` (or 3) to blog, market, events, cart. Keep federation at 1 (handles all AP delivery).
- `blog/entrypoint.sh` — wrap Alembic in PostgreSQL advisory lock (`SELECT pg_advisory_lock(42)`) so only one replica runs migrations
- `docker-compose.yml` — add health checks per service
- `shared/db/session.py` — make pool sizes configurable via env vars (`DB_POOL_SIZE`, `DB_MAX_OVERFLOW`) so replicas can use smaller pools

**Depends on:** T0.4 (PgBouncer), T0.5 (workers).

---

## TIER 3 — Federation Scale

**Target: millions (federated network effects). Effort: weeks.**

### T3.1: Dedicated AP Delivery Service

**Why:** AP delivery is CPU-intensive (RSA signing) and I/O-intensive (HTTP to remote servers). Running inside federation's web worker blocks request processing.

**Changes:**
- New `delivery/` service — standalone asyncio app (no Quart/web server)
- Reads from Redis Stream `coop:delivery:pending` (from T2.2)
- Loads activity from federation DB, loads followers, delivers with semaphore (from T1.1)
- `ap_delivery_handler.py` `on_any_activity` → enqueues to stream instead of delivering inline
- `docker-compose.yml` — add `delivery-worker` service with `replicas: 2`, no port binding

**Depends on:** T2.2 (Redis Streams), T1.1 (concurrent delivery).

### T3.2: Per-Domain Health Tracking + Backoff

**Why:** Dead remote servers waste delivery slots. Current code retries 5 times with no backoff.

**Changes:**
- New `shared/events/domain_health.py` — Redis hash `domain:health:{domain}` tracking consecutive failures, exponential backoff schedule (30s → 1min → 5min → 15min → 1hr → 6hr → 24hr)
- Delivery worker checks domain health before attempting delivery; skips domains in backoff
- On success: reset. On failure: increment + extend backoff.

### T3.3: Shared Inbox Optimization

**Why:** 100 Mastodon followers from one instance = 100 POSTs to the same server. Mastodon supports `sharedInbox` — one POST covers all followers on that instance.

**Changes:**
- `shared/models/federation.py` `APFollower` — add `shared_inbox_url` column
- Migration to backfill from `ap_remote_actors.shared_inbox_url`
- `ap_delivery_handler.py` — group followers by domain, prefer shared inbox when available
- **Impact:** 100 followers on one instance → 1 HTTP POST instead of 100

### T3.4: Table Partitioning for `ap_activities`

**Why:** At millions of activities, the `ap_activities` table becomes the query bottleneck. The `EventProcessor` query orders by `created_at` — native range partitioning fits perfectly.

**Changes:**
- Alembic migration — convert `ap_activities` to `PARTITION BY RANGE (created_at)` with monthly partitions
- Add a cron job or startup hook to create future partitions
- No application code changes needed (transparent to SQLAlchemy)

### T3.5: Read-Through DTO Cache

**Why:** Hot cross-app reads (`cart-summary`, `post-by-slug`) go through HTTP even with T1.5 caching. A Redis-backed DTO cache with event-driven invalidation eliminates HTTP for repeated reads entirely.

**Changes:**
- New `shared/infrastructure/dto_cache.py` — `get(app, query, params)` / `set(...)` backed by Redis
- Integrate into `fetch_data_cached()` as L1 cache (check before HTTP)
- Action endpoints invalidate relevant cache keys after successful writes

**Depends on:** T1.5 (data caching), T2.2 (event-driven invalidation via streams).

---

## Implementation Order

```
TIER 0 (hours):
  T0.1 Auth Redis ──┐
  T0.2 Redis memory ├── all independent, do in parallel
  T0.3 DB split ────┘
  T0.4 PgBouncer ────── after T0.3
  T0.5 Workers ──────── after T0.4

TIER 1 (days):
  T1.1 Concurrent AP ──┐
  T1.3 Partition procs ─├── independent, do in parallel
  T1.4 Pool limits ─────┤
  T1.7 Parse fix ───────┘
  T1.2 Circuit breaker ── after T1.4
  T1.5 Data caching ───── after T1.2 (same pattern)
  T1.6 Nav endpoint ───── after T1.5
  T1.8 Read replicas ──── after T0.3 + T0.4

TIER 2 (~1 week):
  T2.3 CDN ──────────── independent, anytime
  T2.2 Redis Streams ─── foundation for T3.1
  T2.1 Nginx SSI ────── after T1.2 (fragments stable)
  T2.4 Replicas ──────── after T0.4 + T0.5

TIER 3 (weeks):
  T3.1 Delivery service ── after T2.2 + T1.1
  T3.2 Domain health ───── after T3.1
  T3.3 Shared inbox ────── after T3.1
  T3.4 Table partitioning ─ independent
  T3.5 DTO cache ────────── after T1.5 + T2.2
```

## Verification

Each tier has a clear "it works" test:

- **Tier 0:** All 6 apps respond. Login works across apps. `pg_stat_activity` shows connections through PgBouncer. `redis-cli -p 6380 INFO memory` shows auth Redis separate.
- **Tier 1:** Emit a public AP activity with 50+ followers — delivery completes in seconds not minutes. Stop account service — blog pages still render with stale auth-menu. Only federation's processor handles delivery.
- **Tier 2:** Nginx access log shows SSI fragment cache HITs. `XINFO GROUPS coop:activities:pending` shows active consumer groups. CDN cache-status headers show HITs on static assets. Multiple replicas serve traffic.
- **Tier 3:** Delivery worker scales independently. Domain health Redis hash shows backoff state for unreachable servers. `EXPLAIN` on `ap_activities` shows partition pruning. Shared inbox delivery logs show 1 POST per domain.