Files
mono/docs/scalability-plan.md
giles 094b6c55cd Fix AP blueprint cross-DB queries + harden Ghost sync init
AP blueprints (activitypub.py, ap_social.py) were querying federation
tables (ap_actor_profiles etc.) on g.s which points to the app's own DB
after the per-app split. Now uses g._ap_s backed by get_federation_session()
for non-federation apps.

Also hardens Ghost sync before_app_serving to catch/rollback on failure
instead of crashing the Hypercorn worker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 14:06:42 +00:00

16 KiB

Rose Ash Scalability Plan

Context

The coop runs 6 Quart microservices (blog, market, cart, events, federation, account) + Art-DAG (L1/L2 FastAPI) on a single-node Docker Swarm. Current architecture handles light traffic but has concrete bottlenecks: single Postgres, single Redis (256MB), no replicas, single Hypercorn worker per container, sequential AP delivery, no circuit breakers, and 6 competing EventProcessors. The decoupling work (contracts, HTTP data/actions, fragment composition) is complete — the code is structurally ready to scale, but the deployment and runtime aren't.

This plan covers everything from quick config wins to massive federation scale, organized in 4 tiers. Each tier unlocks roughly an order of magnitude.


TIER 0 — Deploy Existing Code + Quick Config

Target: low thousands concurrent. Effort: hours.

T0.1: Separate Auth Redis

Why: Auth keys (grant:*, did_auth:*, prompt:*) on DB 15 share a single Redis instance (256MB, allkeys-lru). Cache pressure from fragment/page caching can silently evict auth state, causing spurious logouts.

Files:

  • docker-compose.yml — add redis-auth service: redis:7-alpine, --maxmemory 64mb --maxmemory-policy noeviction
  • docker-compose.yml — update REDIS_AUTH_URL: redis://redis-auth:6379/0 in x-app-env

No code changesshared/infrastructure/auth_redis.py already reads REDIS_AUTH_URL from env.

T0.2: Bump Redis Memory

File: docker-compose.yml (line 168)

  • Change --maxmemory 256mb to --maxmemory 1gb (or 512mb minimum)
  • Keep allkeys-lru for the data Redis (fragments + page cache)

T0.3: Deploy the Database Split

What exists: _config/init-databases.sql (creates 6 DBs) and _config/split-databases.sh (migrates table groups). Code in session.py lines 46-101 already creates separate engines when URLs differ. bus.py, user_loader.py, and factory.py all have conditional cross-DB paths.

Files:

  • docker-compose.yml — per-service DATABASE_URL overrides (blog→db_blog, market→db_market, etc.) plus DATABASE_URL_ACCOUNTdb_account, DATABASE_URL_FEDERATIONdb_federation
  • _config/split-databases.sh — add menu_nodes + container_relations to ALL target DBs (small read-only tables needed by get_navigation_tree() in non-blog apps until T1.7 replaces this)

Deployment: run init-databases.sql, stop services, run split-databases.sh, update compose env, redeploy.

T0.4: Add PgBouncer

Why: 6 apps x pool_size 5 + overflow 10 = 90 connections to one Postgres. After DB split + workers, this multiplies. Postgres default max_connections=100 will be hit.

Files:

  • docker-compose.yml — add pgbouncer service (transaction-mode pooling, default_pool_size=20, max_client_conn=300)
  • docker-compose.yml — change all DATABASE_URL values from @db:5432 to @pgbouncer:6432
  • shared/db/session.py (lines 13-20) — add pool_timeout=10, pool_recycle=1800. Consider reducing pool_size to 3 since PgBouncer handles pooling.

T0.5: Hypercorn Workers

Why: Single async event loop per container. CPU-bound work (Jinja2 rendering, RSA signing) blocks everything.

Files:

  • All 6 {app}/entrypoint.sh — change Hypercorn command to:
    exec hypercorn "${APP_MODULE:-app:app}" --bind 0.0.0.0:${PORT:-8000} --workers ${WORKERS:-2} --keep-alive 75
    

No code changes neededEventProcessor uses SKIP LOCKED so multiple workers competing is safe. Each worker gets its own httpx client singleton and SQLAlchemy pool (correct fork behavior).

Depends on: T0.4 (PgBouncer) — doubling workers doubles connection count.


TIER 1 — Fix Hot Paths

Target: tens of thousands concurrent. Effort: days.

T1.1: Concurrent AP Delivery

The single biggest bottleneck. ap_delivery_handler.py delivers to inboxes sequentially (lines 230-246). 100 followers = up to 100 x 15s = 25 minutes.

File: shared/events/handlers/ap_delivery_handler.py

  • Add DELIVERY_CONCURRENCY = 10 semaphore
  • Replace sequential inbox loop with asyncio.gather() bounded by semaphore
  • Reuse a single httpx.AsyncClient per on_any_activity call with explicit pool limits: httpx.Limits(max_connections=20, max_keepalive_connections=10)
  • Batch APDeliveryLog inserts (one flush per activity, not per inbox)

Result: 100 followers drops from 25 min → ~2.5 min (10 concurrent) or less.

T1.2: Fragment Circuit Breaker + Stale-While-Revalidate

Why: Every page render makes 3-4 internal HTTP calls (fetch_fragments()). A slow/down service blocks all page renders for up to 2s per fragment. No graceful degradation.

File: shared/infrastructure/fragments.py

Circuit breaker (add near line 27):

  • Per-app _CircuitState tracking consecutive failures
  • Threshold: 3 failures → circuit opens for 30s
  • When open: skip HTTP, fall through to stale cache

Stale-while-revalidate (modify fetch_fragment_cached(), lines 246-288):

  • Store fragments in Redis as {"html": "...", "ts": 1234567890.0} instead of plain string
  • Soft TTL = normal TTL (30s). Hard TTL = soft + 300s (5 min stale window)
  • Within soft TTL: return cached. Between soft and hard: return cached + background revalidate. Past hard: block on fetch. On fetch failure: return stale if available, empty string if not.

T1.3: Partition Event Processors

Why: All 6 apps register the wildcard AP delivery handler via register_shared_handlers(). All 6 EventProcessor instances compete for every activity with SKIP LOCKED. 5 of 6 do wasted work on every public activity.

Files:

  • shared/events/handlers/__init__.py — add app_name param to register_shared_handlers(). Only import ap_delivery_handler and external_delivery_handler when app_name == "federation".
  • shared/infrastructure/factory.py (line 277) — pass name to register_shared_handlers(name)
  • shared/infrastructure/factory.py (line 41) — add event_processor_all_origins: bool = False param to create_base_app()
  • shared/infrastructure/factory.py (line 271) — EventProcessor(app_name=None if event_processor_all_origins else name)
  • federation/app.py — pass event_processor_all_origins=True

Result: Federation processes ALL activities (no origin filter) for delivery. Other apps process only their own origin activities for domain-specific handlers. No wasted lock contention.

T1.4: httpx Client Pool Limits

Why: All three HTTP clients (fragments.py, data_client.py, actions.py) have no limits parameter. Default is 100 connections per client. With workers + replicas this fans out unbounded.

Files:

  • shared/infrastructure/fragments.py (lines 38-45) — add limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)
  • shared/infrastructure/data_client.py (lines 36-43) — same
  • shared/infrastructure/actions.py (lines 37-44) — max_connections=10, max_keepalive_connections=5

T1.5: Data Client Caching

Why: fetch_data() has no caching. cart-summary is called on every page load across 4 apps. Blog post lookups by slug happen on every market/events page.

File: shared/infrastructure/data_client.py

  • Add fetch_data_cached() following the fetch_fragment_cached() pattern
  • Redis key: data:{app}:{query}:{sorted_params}, default TTL=10s
  • Same circuit breaker + SWR as T1.2

Callers updated: all app.py context functions that call fetch_data() for repeated reads.

T1.6: Navigation via HTTP Data Endpoint

Why: After DB split, get_navigation_tree() queries menu_nodes via g.s. But menu_nodes lives in db_blog. The T0.3 workaround (replicate table) works short-term; this is the proper fix.

Files:

  • shared/contracts/dtos.py — add MenuNodeDTO
  • blog/bp/data/routes.py — add nav-tree handler returning [dto_to_dict(node) for node in nodes]
  • All non-blog app.py context functions — replace get_navigation_tree(g.s) with fetch_data_cached("blog", "nav-tree", ttl=60)
  • Remove menu_nodes from non-blog DBs in split-databases.sh (no longer needed)

Depends on: T0.3 (DB split), T1.5 (data caching).

T1.7: Fix Fragment Batch Parser O(n^2)

File: shared/infrastructure/fragments.py _parse_fragment_markers() (lines 217-243)

  • Replace nested loop with single-pass: find all <!-- fragment:{key} --> markers in one scan, extract content between consecutive markers
  • O(n) instead of O(n^2)

T1.8: Read Replicas

Why: After DB split, each domain DB has one writer. Read-heavy pages (listings, calendars, product pages) can saturate it.

Files:

  • docker-compose.yml — add read replicas for high-traffic domains (blog, events)
  • shared/db/session.py — add DATABASE_URL_RO env var, create read-only engine, add get_read_session() context manager
  • All bp/data/routes.py and bp/fragments/routes.py — use read session (these endpoints are inherently read-only)

Depends on: T0.3 (DB split), T0.4 (PgBouncer).


TIER 2 — Decouple the Runtime

Target: hundreds of thousands concurrent. Effort: ~1 week.

T2.1: Edge-Side Fragment Composition (Nginx SSI)

Why: Currently every Quart app fetches 3-4 fragments per request via HTTP (fetch_fragments() in context processors). This adds latency and creates liveness coupling. SSI moves fragment assembly to Nginx, which caches each fragment independently.

Changes:

  • shared/infrastructure/jinja_setup.py — change _fragment() Jinja global to emit SSI directives: <!--#include virtual="/_ssi/{app}/{type}?{params}" -->
  • New Nginx config — map /_ssi/{app}/{path} to http://{app}:8000/internal/fragments/{path}, enable ssi on, proxy_cache with short TTL
  • All fragment blueprint routes — add Cache-Control: public, max-age={ttl}, stale-while-revalidate=300 headers
  • Remove fetch_fragments() calls from all app.py context processors (templates emit SSI directly)

T2.2: Replace Outbox Polling with Redis Streams

Why: EventProcessor polls ap_activities every 2s with SELECT FOR UPDATE SKIP LOCKED. This creates persistent DB load even when idle. LISTEN/NOTIFY helps but is fragile.

Changes:

  • shared/events/bus.py emit_activity() — after writing to DB, also XADD coop:activities:pending with {activity_id, origin_app, type}
  • shared/events/processor.py — replace _poll_loop + _listen_for_notify with XREADGROUP (blocking read, no polling). Consumer groups handle partitioning. XPENDING + XCLAIM replace the stuck activity reaper.
  • Redis Stream config: MAXLEN ~10000 to cap memory

T2.3: CDN for Static Assets

  • Route *.rose-ash.com/static/* through CDN (Cloudflare, BunnyCDN)
  • _asset_url() already adds ?v={hash} fingerprint — CDN can cache with max-age=31536000, immutable
  • No code changes, just DNS + CDN config

T2.4: Horizontal Scaling (Docker Swarm Replicas)

Changes:

  • docker-compose.yml — add replicas: 2 (or 3) to blog, market, events, cart. Keep federation at 1 (handles all AP delivery).
  • blog/entrypoint.sh — wrap Alembic in PostgreSQL advisory lock (SELECT pg_advisory_lock(42)) so only one replica runs migrations
  • docker-compose.yml — add health checks per service
  • shared/db/session.py — make pool sizes configurable via env vars (DB_POOL_SIZE, DB_MAX_OVERFLOW) so replicas can use smaller pools

Depends on: T0.4 (PgBouncer), T0.5 (workers).


TIER 3 — Federation Scale

Target: millions (federated network effects). Effort: weeks.

T3.1: Dedicated AP Delivery Service

Why: AP delivery is CPU-intensive (RSA signing) and I/O-intensive (HTTP to remote servers). Running inside federation's web worker blocks request processing.

Changes:

  • New delivery/ service — standalone asyncio app (no Quart/web server)
  • Reads from Redis Stream coop:delivery:pending (from T2.2)
  • Loads activity from federation DB, loads followers, delivers with semaphore (from T1.1)
  • ap_delivery_handler.py on_any_activity → enqueues to stream instead of delivering inline
  • docker-compose.yml — add delivery-worker service with replicas: 2, no port binding

Depends on: T2.2 (Redis Streams), T1.1 (concurrent delivery).

T3.2: Per-Domain Health Tracking + Backoff

Why: Dead remote servers waste delivery slots. Current code retries 5 times with no backoff.

Changes:

  • New shared/events/domain_health.py — Redis hash domain:health:{domain} tracking consecutive failures, exponential backoff schedule (30s → 1min → 5min → 15min → 1hr → 6hr → 24hr)
  • Delivery worker checks domain health before attempting delivery; skips domains in backoff
  • On success: reset. On failure: increment + extend backoff.

T3.3: Shared Inbox Optimization

Why: 100 Mastodon followers from one instance = 100 POSTs to the same server. Mastodon supports sharedInbox — one POST covers all followers on that instance.

Changes:

  • shared/models/federation.py APFollower — add shared_inbox_url column
  • Migration to backfill from ap_remote_actors.shared_inbox_url
  • ap_delivery_handler.py — group followers by domain, prefer shared inbox when available
  • Impact: 100 followers on one instance → 1 HTTP POST instead of 100

T3.4: Table Partitioning for ap_activities

Why: At millions of activities, the ap_activities table becomes the query bottleneck. The EventProcessor query orders by created_at — native range partitioning fits perfectly.

Changes:

  • Alembic migration — convert ap_activities to PARTITION BY RANGE (created_at) with monthly partitions
  • Add a cron job or startup hook to create future partitions
  • No application code changes needed (transparent to SQLAlchemy)

T3.5: Read-Through DTO Cache

Why: Hot cross-app reads (cart-summary, post-by-slug) go through HTTP even with T1.5 caching. A Redis-backed DTO cache with event-driven invalidation eliminates HTTP for repeated reads entirely.

Changes:

  • New shared/infrastructure/dto_cache.pyget(app, query, params) / set(...) backed by Redis
  • Integrate into fetch_data_cached() as L1 cache (check before HTTP)
  • Action endpoints invalidate relevant cache keys after successful writes

Depends on: T1.5 (data caching), T2.2 (event-driven invalidation via streams).


Implementation Order

TIER 0 (hours):
  T0.1 Auth Redis ──┐
  T0.2 Redis memory ├── all independent, do in parallel
  T0.3 DB split ────┘
  T0.4 PgBouncer ────── after T0.3
  T0.5 Workers ──────── after T0.4

TIER 1 (days):
  T1.1 Concurrent AP ──┐
  T1.3 Partition procs ─├── independent, do in parallel
  T1.4 Pool limits ─────┤
  T1.7 Parse fix ───────┘
  T1.2 Circuit breaker ── after T1.4
  T1.5 Data caching ───── after T1.2 (same pattern)
  T1.6 Nav endpoint ───── after T1.5
  T1.8 Read replicas ──── after T0.3 + T0.4

TIER 2 (~1 week):
  T2.3 CDN ──────────── independent, anytime
  T2.2 Redis Streams ─── foundation for T3.1
  T2.1 Nginx SSI ────── after T1.2 (fragments stable)
  T2.4 Replicas ──────── after T0.4 + T0.5

TIER 3 (weeks):
  T3.1 Delivery service ── after T2.2 + T1.1
  T3.2 Domain health ───── after T3.1
  T3.3 Shared inbox ────── after T3.1
  T3.4 Table partitioning ─ independent
  T3.5 DTO cache ────────── after T1.5 + T2.2

Verification

Each tier has a clear "it works" test:

  • Tier 0: All 6 apps respond. Login works across apps. pg_stat_activity shows connections through PgBouncer. redis-cli -p 6380 INFO memory shows auth Redis separate.
  • Tier 1: Emit a public AP activity with 50+ followers — delivery completes in seconds not minutes. Stop account service — blog pages still render with stale auth-menu. Only federation's processor handles delivery.
  • Tier 2: Nginx access log shows SSI fragment cache HITs. XINFO GROUPS coop:activities:pending shows active consumer groups. CDN cache-status headers show HITs on static assets. Multiple replicas serve traffic.
  • Tier 3: Delivery worker scales independently. Domain health Redis hash shows backoff state for unreachable servers. EXPLAIN on ap_activities shows partition pruning. Shared inbox delivery logs show 1 POST per domain.