# Rose Ash Scalability Plan ## Context The coop runs 6 Quart microservices (blog, market, cart, events, federation, account) + Art-DAG (L1/L2 FastAPI) on a single-node Docker Swarm. Current architecture handles light traffic but has concrete bottlenecks: single Postgres, single Redis (256MB), no replicas, single Hypercorn worker per container, sequential AP delivery, no circuit breakers, and 6 competing EventProcessors. The decoupling work (contracts, HTTP data/actions, fragment composition) is complete — the code is structurally ready to scale, but the deployment and runtime aren't. This plan covers everything from quick config wins to massive federation scale, organized in 4 tiers. Each tier unlocks roughly an order of magnitude. --- ## TIER 0 — Deploy Existing Code + Quick Config **Target: low thousands concurrent. Effort: hours.** ### T0.1: Separate Auth Redis **Why:** Auth keys (`grant:*`, `did_auth:*`, `prompt:*`) on DB 15 share a single Redis instance (256MB, `allkeys-lru`). Cache pressure from fragment/page caching can silently evict auth state, causing spurious logouts. **Files:** - `docker-compose.yml` — add `redis-auth` service: `redis:7-alpine`, `--maxmemory 64mb --maxmemory-policy noeviction` - `docker-compose.yml` — update `REDIS_AUTH_URL: redis://redis-auth:6379/0` in `x-app-env` **No code changes** — `shared/infrastructure/auth_redis.py` already reads `REDIS_AUTH_URL` from env. ### T0.2: Bump Redis Memory **File:** `docker-compose.yml` (line 168) - Change `--maxmemory 256mb` to `--maxmemory 1gb` (or 512mb minimum) - Keep `allkeys-lru` for the data Redis (fragments + page cache) ### T0.3: Deploy the Database Split **What exists:** `_config/init-databases.sql` (creates 6 DBs) and `_config/split-databases.sh` (migrates table groups). Code in `session.py` lines 46-101 already creates separate engines when URLs differ. `bus.py`, `user_loader.py`, and `factory.py` all have conditional cross-DB paths. **Files:** - `docker-compose.yml` — per-service `DATABASE_URL` overrides (blog→`db_blog`, market→`db_market`, etc.) plus `DATABASE_URL_ACCOUNT`→`db_account`, `DATABASE_URL_FEDERATION`→`db_federation` - `_config/split-databases.sh` — add `menu_nodes` + `container_relations` to ALL target DBs (small read-only tables needed by `get_navigation_tree()` in non-blog apps until T1.7 replaces this) **Deployment:** run `init-databases.sql`, stop services, run `split-databases.sh`, update compose env, redeploy. ### T0.4: Add PgBouncer **Why:** 6 apps x pool_size 5 + overflow 10 = 90 connections to one Postgres. After DB split + workers, this multiplies. Postgres default `max_connections=100` will be hit. **Files:** - `docker-compose.yml` — add `pgbouncer` service (transaction-mode pooling, `default_pool_size=20`, `max_client_conn=300`) - `docker-compose.yml` — change all `DATABASE_URL` values from `@db:5432` to `@pgbouncer:6432` - `shared/db/session.py` (lines 13-20) — add `pool_timeout=10`, `pool_recycle=1800`. Consider reducing `pool_size` to 3 since PgBouncer handles pooling. ### T0.5: Hypercorn Workers **Why:** Single async event loop per container. CPU-bound work (Jinja2 rendering, RSA signing) blocks everything. **Files:** - All 6 `{app}/entrypoint.sh` — change Hypercorn command to: ``` exec hypercorn "${APP_MODULE:-app:app}" --bind 0.0.0.0:${PORT:-8000} --workers ${WORKERS:-2} --keep-alive 75 ``` **No code changes needed** — `EventProcessor` uses `SKIP LOCKED` so multiple workers competing is safe. Each worker gets its own httpx client singleton and SQLAlchemy pool (correct fork behavior). **Depends on:** T0.4 (PgBouncer) — doubling workers doubles connection count. --- ## TIER 1 — Fix Hot Paths **Target: tens of thousands concurrent. Effort: days.** ### T1.1: Concurrent AP Delivery **The single biggest bottleneck.** `ap_delivery_handler.py` delivers to inboxes sequentially (lines 230-246). 100 followers = up to 100 x 15s = 25 minutes. **File:** `shared/events/handlers/ap_delivery_handler.py` - Add `DELIVERY_CONCURRENCY = 10` semaphore - Replace sequential inbox loop with `asyncio.gather()` bounded by semaphore - Reuse a single `httpx.AsyncClient` per `on_any_activity` call with explicit pool limits: `httpx.Limits(max_connections=20, max_keepalive_connections=10)` - Batch `APDeliveryLog` inserts (one flush per activity, not per inbox) **Result:** 100 followers drops from 25 min → ~2.5 min (10 concurrent) or less. ### T1.2: Fragment Circuit Breaker + Stale-While-Revalidate **Why:** Every page render makes 3-4 internal HTTP calls (`fetch_fragments()`). A slow/down service blocks all page renders for up to 2s per fragment. No graceful degradation. **File:** `shared/infrastructure/fragments.py` **Circuit breaker** (add near line 27): - Per-app `_CircuitState` tracking consecutive failures - Threshold: 3 failures → circuit opens for 30s - When open: skip HTTP, fall through to stale cache **Stale-while-revalidate** (modify `fetch_fragment_cached()`, lines 246-288): - Store fragments in Redis as `{"html": "...", "ts": 1234567890.0}` instead of plain string - Soft TTL = normal TTL (30s). Hard TTL = soft + 300s (5 min stale window) - Within soft TTL: return cached. Between soft and hard: return cached + background revalidate. Past hard: block on fetch. On fetch failure: return stale if available, empty string if not. ### T1.3: Partition Event Processors **Why:** All 6 apps register the wildcard AP delivery handler via `register_shared_handlers()`. All 6 `EventProcessor` instances compete for every activity with `SKIP LOCKED`. 5 of 6 do wasted work on every public activity. **Files:** - `shared/events/handlers/__init__.py` — add `app_name` param to `register_shared_handlers()`. Only import `ap_delivery_handler` and `external_delivery_handler` when `app_name == "federation"`. - `shared/infrastructure/factory.py` (line 277) — pass `name` to `register_shared_handlers(name)` - `shared/infrastructure/factory.py` (line 41) — add `event_processor_all_origins: bool = False` param to `create_base_app()` - `shared/infrastructure/factory.py` (line 271) — `EventProcessor(app_name=None if event_processor_all_origins else name)` - `federation/app.py` — pass `event_processor_all_origins=True` **Result:** Federation processes ALL activities (no origin filter) for delivery. Other apps process only their own origin activities for domain-specific handlers. No wasted lock contention. ### T1.4: httpx Client Pool Limits **Why:** All three HTTP clients (`fragments.py`, `data_client.py`, `actions.py`) have no `limits` parameter. Default is 100 connections per client. With workers + replicas this fans out unbounded. **Files:** - `shared/infrastructure/fragments.py` (lines 38-45) — add `limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)` - `shared/infrastructure/data_client.py` (lines 36-43) — same - `shared/infrastructure/actions.py` (lines 37-44) — `max_connections=10, max_keepalive_connections=5` ### T1.5: Data Client Caching **Why:** `fetch_data()` has no caching. `cart-summary` is called on every page load across 4 apps. Blog post lookups by slug happen on every market/events page. **File:** `shared/infrastructure/data_client.py` - Add `fetch_data_cached()` following the `fetch_fragment_cached()` pattern - Redis key: `data:{app}:{query}:{sorted_params}`, default TTL=10s - Same circuit breaker + SWR as T1.2 **Callers updated:** all `app.py` context functions that call `fetch_data()` for repeated reads. ### T1.6: Navigation via HTTP Data Endpoint **Why:** After DB split, `get_navigation_tree()` queries `menu_nodes` via `g.s`. But `menu_nodes` lives in `db_blog`. The T0.3 workaround (replicate table) works short-term; this is the proper fix. **Files:** - `shared/contracts/dtos.py` — add `MenuNodeDTO` - `blog/bp/data/routes.py` — add `nav-tree` handler returning `[dto_to_dict(node) for node in nodes]` - All non-blog `app.py` context functions — replace `get_navigation_tree(g.s)` with `fetch_data_cached("blog", "nav-tree", ttl=60)` - Remove `menu_nodes` from non-blog DBs in `split-databases.sh` (no longer needed) **Depends on:** T0.3 (DB split), T1.5 (data caching). ### T1.7: Fix Fragment Batch Parser O(n^2) **File:** `shared/infrastructure/fragments.py` `_parse_fragment_markers()` (lines 217-243) - Replace nested loop with single-pass: find all `` markers in one scan, extract content between consecutive markers - O(n) instead of O(n^2) ### T1.8: Read Replicas **Why:** After DB split, each domain DB has one writer. Read-heavy pages (listings, calendars, product pages) can saturate it. **Files:** - `docker-compose.yml` — add read replicas for high-traffic domains (blog, events) - `shared/db/session.py` — add `DATABASE_URL_RO` env var, create read-only engine, add `get_read_session()` context manager - All `bp/data/routes.py` and `bp/fragments/routes.py` — use read session (these endpoints are inherently read-only) **Depends on:** T0.3 (DB split), T0.4 (PgBouncer). --- ## TIER 2 — Decouple the Runtime **Target: hundreds of thousands concurrent. Effort: ~1 week.** ### T2.1: Edge-Side Fragment Composition (Nginx SSI) **Why:** Currently every Quart app fetches 3-4 fragments per request via HTTP (`fetch_fragments()` in context processors). This adds latency and creates liveness coupling. SSI moves fragment assembly to Nginx, which caches each fragment independently. **Changes:** - `shared/infrastructure/jinja_setup.py` — change `_fragment()` Jinja global to emit SSI directives: `` - New Nginx config — map `/_ssi/{app}/{path}` to `http://{app}:8000/internal/fragments/{path}`, enable `ssi on`, proxy_cache with short TTL - All fragment blueprint routes — add `Cache-Control: public, max-age={ttl}, stale-while-revalidate=300` headers - Remove `fetch_fragments()` calls from all `app.py` context processors (templates emit SSI directly) ### T2.2: Replace Outbox Polling with Redis Streams **Why:** `EventProcessor` polls `ap_activities` every 2s with `SELECT FOR UPDATE SKIP LOCKED`. This creates persistent DB load even when idle. LISTEN/NOTIFY helps but is fragile. **Changes:** - `shared/events/bus.py` `emit_activity()` — after writing to DB, also `XADD coop:activities:pending` with `{activity_id, origin_app, type}` - `shared/events/processor.py` — replace `_poll_loop` + `_listen_for_notify` with `XREADGROUP` (blocking read, no polling). Consumer groups handle partitioning. `XPENDING` + `XCLAIM` replace the stuck activity reaper. - Redis Stream config: `MAXLEN ~10000` to cap memory ### T2.3: CDN for Static Assets - Route `*.rose-ash.com/static/*` through CDN (Cloudflare, BunnyCDN) - `_asset_url()` already adds `?v={hash}` fingerprint — CDN can cache with `max-age=31536000, immutable` - No code changes, just DNS + CDN config ### T2.4: Horizontal Scaling (Docker Swarm Replicas) **Changes:** - `docker-compose.yml` — add `replicas: 2` (or 3) to blog, market, events, cart. Keep federation at 1 (handles all AP delivery). - `blog/entrypoint.sh` — wrap Alembic in PostgreSQL advisory lock (`SELECT pg_advisory_lock(42)`) so only one replica runs migrations - `docker-compose.yml` — add health checks per service - `shared/db/session.py` — make pool sizes configurable via env vars (`DB_POOL_SIZE`, `DB_MAX_OVERFLOW`) so replicas can use smaller pools **Depends on:** T0.4 (PgBouncer), T0.5 (workers). --- ## TIER 3 — Federation Scale **Target: millions (federated network effects). Effort: weeks.** ### T3.1: Dedicated AP Delivery Service **Why:** AP delivery is CPU-intensive (RSA signing) and I/O-intensive (HTTP to remote servers). Running inside federation's web worker blocks request processing. **Changes:** - New `delivery/` service — standalone asyncio app (no Quart/web server) - Reads from Redis Stream `coop:delivery:pending` (from T2.2) - Loads activity from federation DB, loads followers, delivers with semaphore (from T1.1) - `ap_delivery_handler.py` `on_any_activity` → enqueues to stream instead of delivering inline - `docker-compose.yml` — add `delivery-worker` service with `replicas: 2`, no port binding **Depends on:** T2.2 (Redis Streams), T1.1 (concurrent delivery). ### T3.2: Per-Domain Health Tracking + Backoff **Why:** Dead remote servers waste delivery slots. Current code retries 5 times with no backoff. **Changes:** - New `shared/events/domain_health.py` — Redis hash `domain:health:{domain}` tracking consecutive failures, exponential backoff schedule (30s → 1min → 5min → 15min → 1hr → 6hr → 24hr) - Delivery worker checks domain health before attempting delivery; skips domains in backoff - On success: reset. On failure: increment + extend backoff. ### T3.3: Shared Inbox Optimization **Why:** 100 Mastodon followers from one instance = 100 POSTs to the same server. Mastodon supports `sharedInbox` — one POST covers all followers on that instance. **Changes:** - `shared/models/federation.py` `APFollower` — add `shared_inbox_url` column - Migration to backfill from `ap_remote_actors.shared_inbox_url` - `ap_delivery_handler.py` — group followers by domain, prefer shared inbox when available - **Impact:** 100 followers on one instance → 1 HTTP POST instead of 100 ### T3.4: Table Partitioning for `ap_activities` **Why:** At millions of activities, the `ap_activities` table becomes the query bottleneck. The `EventProcessor` query orders by `created_at` — native range partitioning fits perfectly. **Changes:** - Alembic migration — convert `ap_activities` to `PARTITION BY RANGE (created_at)` with monthly partitions - Add a cron job or startup hook to create future partitions - No application code changes needed (transparent to SQLAlchemy) ### T3.5: Read-Through DTO Cache **Why:** Hot cross-app reads (`cart-summary`, `post-by-slug`) go through HTTP even with T1.5 caching. A Redis-backed DTO cache with event-driven invalidation eliminates HTTP for repeated reads entirely. **Changes:** - New `shared/infrastructure/dto_cache.py` — `get(app, query, params)` / `set(...)` backed by Redis - Integrate into `fetch_data_cached()` as L1 cache (check before HTTP) - Action endpoints invalidate relevant cache keys after successful writes **Depends on:** T1.5 (data caching), T2.2 (event-driven invalidation via streams). --- ## Implementation Order ``` TIER 0 (hours): T0.1 Auth Redis ──┐ T0.2 Redis memory ├── all independent, do in parallel T0.3 DB split ────┘ T0.4 PgBouncer ────── after T0.3 T0.5 Workers ──────── after T0.4 TIER 1 (days): T1.1 Concurrent AP ──┐ T1.3 Partition procs ─├── independent, do in parallel T1.4 Pool limits ─────┤ T1.7 Parse fix ───────┘ T1.2 Circuit breaker ── after T1.4 T1.5 Data caching ───── after T1.2 (same pattern) T1.6 Nav endpoint ───── after T1.5 T1.8 Read replicas ──── after T0.3 + T0.4 TIER 2 (~1 week): T2.3 CDN ──────────── independent, anytime T2.2 Redis Streams ─── foundation for T3.1 T2.1 Nginx SSI ────── after T1.2 (fragments stable) T2.4 Replicas ──────── after T0.4 + T0.5 TIER 3 (weeks): T3.1 Delivery service ── after T2.2 + T1.1 T3.2 Domain health ───── after T3.1 T3.3 Shared inbox ────── after T3.1 T3.4 Table partitioning ─ independent T3.5 DTO cache ────────── after T1.5 + T2.2 ``` ## Verification Each tier has a clear "it works" test: - **Tier 0:** All 6 apps respond. Login works across apps. `pg_stat_activity` shows connections through PgBouncer. `redis-cli -p 6380 INFO memory` shows auth Redis separate. - **Tier 1:** Emit a public AP activity with 50+ followers — delivery completes in seconds not minutes. Stop account service — blog pages still render with stale auth-menu. Only federation's processor handles delivery. - **Tier 2:** Nginx access log shows SSI fragment cache HITs. `XINFO GROUPS coop:activities:pending` shows active consumer groups. CDN cache-status headers show HITs on static assets. Multiple replicas serve traffic. - **Tier 3:** Delivery worker scales independently. Domain health Redis hash shows backoff state for unreachable servers. `EXPLAIN` on `ap_activities` shows partition pruning. Shared inbox delivery logs show 1 POST per domain.