Files

giles 094b6c55cd Fix AP blueprint cross-DB queries + harden Ghost sync init

AP blueprints (activitypub.py, ap_social.py) were querying federation
tables (ap_actor_profiles etc.) on g.s which points to the app's own DB
after the per-app split. Now uses g._ap_s backed by get_federation_session()
for non-federation apps.

Also hardens Ghost sync before_app_serving to catch/rollback on failure
instead of crashing the Hypercorn worker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-25 14:06:42 +00:00

16 KiB

Raw Blame History

Rose Ash Scalability Plan

Context

The coop runs 6 Quart microservices (blog, market, cart, events, federation, account) + Art-DAG (L1/L2 FastAPI) on a single-node Docker Swarm. Current architecture handles light traffic but has concrete bottlenecks: single Postgres, single Redis (256MB), no replicas, single Hypercorn worker per container, sequential AP delivery, no circuit breakers, and 6 competing EventProcessors. The decoupling work (contracts, HTTP data/actions, fragment composition) is complete — the code is structurally ready to scale, but the deployment and runtime aren't.

This plan covers everything from quick config wins to massive federation scale, organized in 4 tiers. Each tier unlocks roughly an order of magnitude.

TIER 0 — Deploy Existing Code + Quick Config

Target: low thousands concurrent. Effort: hours.

T0.1: Separate Auth Redis

Why: Auth keys (grant:*, did_auth:*, prompt:*) on DB 15 share a single Redis instance (256MB, allkeys-lru). Cache pressure from fragment/page caching can silently evict auth state, causing spurious logouts.

Files:

docker-compose.yml — add redis-auth service: redis:7-alpine, --maxmemory 64mb --maxmemory-policy noeviction
docker-compose.yml — update REDIS_AUTH_URL: redis://redis-auth:6379/0 in x-app-env

No code changes — shared/infrastructure/auth_redis.py already reads REDIS_AUTH_URL from env.

T0.2: Bump Redis Memory

File: docker-compose.yml (line 168)

Change --maxmemory 256mb to --maxmemory 1gb (or 512mb minimum)
Keep allkeys-lru for the data Redis (fragments + page cache)

T0.3: Deploy the Database Split

What exists: _config/init-databases.sql (creates 6 DBs) and _config/split-databases.sh (migrates table groups). Code in session.py lines 46-101 already creates separate engines when URLs differ. bus.py, user_loader.py, and factory.py all have conditional cross-DB paths.

Files:

docker-compose.yml — per-service DATABASE_URL overrides (blog→db_blog, market→db_market, etc.) plus DATABASE_URL_ACCOUNT→db_account, DATABASE_URL_FEDERATION→db_federation
_config/split-databases.sh — add menu_nodes + container_relations to ALL target DBs (small read-only tables needed by get_navigation_tree() in non-blog apps until T1.7 replaces this)

Deployment: run init-databases.sql, stop services, run split-databases.sh, update compose env, redeploy.

T0.4: Add PgBouncer

Why: 6 apps x pool_size 5 + overflow 10 = 90 connections to one Postgres. After DB split + workers, this multiplies. Postgres default max_connections=100 will be hit.

Files:

docker-compose.yml — add pgbouncer service (transaction-mode pooling, default_pool_size=20, max_client_conn=300)
docker-compose.yml — change all DATABASE_URL values from @db:5432 to @pgbouncer:6432
shared/db/session.py (lines 13-20) — add pool_timeout=10, pool_recycle=1800. Consider reducing pool_size to 3 since PgBouncer handles pooling.

T0.5: Hypercorn Workers

Why: Single async event loop per container. CPU-bound work (Jinja2 rendering, RSA signing) blocks everything.

Files:

All 6 {app}/entrypoint.sh — change Hypercorn command to:

exec hypercorn "${APP_MODULE:-app:app}" --bind 0.0.0.0:${PORT:-8000} --workers ${WORKERS:-2} --keep-alive 75

No code changes needed — EventProcessor uses SKIP LOCKED so multiple workers competing is safe. Each worker gets its own httpx client singleton and SQLAlchemy pool (correct fork behavior).

Depends on: T0.4 (PgBouncer) — doubling workers doubles connection count.

TIER 1 — Fix Hot Paths

Target: tens of thousands concurrent. Effort: days.

T1.1: Concurrent AP Delivery

The single biggest bottleneck. ap_delivery_handler.py delivers to inboxes sequentially (lines 230-246). 100 followers = up to 100 x 15s = 25 minutes.

File: shared/events/handlers/ap_delivery_handler.py

Add DELIVERY_CONCURRENCY = 10 semaphore
Replace sequential inbox loop with asyncio.gather() bounded by semaphore
Reuse a single httpx.AsyncClient per on_any_activity call with explicit pool limits: httpx.Limits(max_connections=20, max_keepalive_connections=10)
Batch APDeliveryLog inserts (one flush per activity, not per inbox)

Result: 100 followers drops from 25 min → ~2.5 min (10 concurrent) or less.

T1.2: Fragment Circuit Breaker + Stale-While-Revalidate

Why: Every page render makes 3-4 internal HTTP calls (fetch_fragments()). A slow/down service blocks all page renders for up to 2s per fragment. No graceful degradation.

File: shared/infrastructure/fragments.py

Circuit breaker (add near line 27):

Per-app _CircuitState tracking consecutive failures
Threshold: 3 failures → circuit opens for 30s
When open: skip HTTP, fall through to stale cache

Stale-while-revalidate (modify fetch_fragment_cached(), lines 246-288):

Store fragments in Redis as {"html": "...", "ts": 1234567890.0} instead of plain string
Soft TTL = normal TTL (30s). Hard TTL = soft + 300s (5 min stale window)
Within soft TTL: return cached. Between soft and hard: return cached + background revalidate. Past hard: block on fetch. On fetch failure: return stale if available, empty string if not.

T1.3: Partition Event Processors

Why: All 6 apps register the wildcard AP delivery handler via register_shared_handlers(). All 6 EventProcessor instances compete for every activity with SKIP LOCKED. 5 of 6 do wasted work on every public activity.

Files:

shared/events/handlers/__init__.py — add app_name param to register_shared_handlers(). Only import ap_delivery_handler and external_delivery_handler when app_name == "federation".
shared/infrastructure/factory.py (line 277) — pass name to register_shared_handlers(name)
shared/infrastructure/factory.py (line 41) — add event_processor_all_origins: bool = False param to create_base_app()
shared/infrastructure/factory.py (line 271) — EventProcessor(app_name=None if event_processor_all_origins else name)
federation/app.py — pass event_processor_all_origins=True

Result: Federation processes ALL activities (no origin filter) for delivery. Other apps process only their own origin activities for domain-specific handlers. No wasted lock contention.

T1.4: httpx Client Pool Limits

Why: All three HTTP clients (fragments.py, data_client.py, actions.py) have no limits parameter. Default is 100 connections per client. With workers + replicas this fans out unbounded.

Files:

shared/infrastructure/fragments.py (lines 38-45) — add limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)
shared/infrastructure/data_client.py (lines 36-43) — same
shared/infrastructure/actions.py (lines 37-44) — max_connections=10, max_keepalive_connections=5

T1.5: Data Client Caching

Why: fetch_data() has no caching. cart-summary is called on every page load across 4 apps. Blog post lookups by slug happen on every market/events page.

File: shared/infrastructure/data_client.py

Add fetch_data_cached() following the fetch_fragment_cached() pattern
Redis key: data:{app}:{query}:{sorted_params}, default TTL=10s
Same circuit breaker + SWR as T1.2

Callers updated: all app.py context functions that call fetch_data() for repeated reads.

Why: After DB split, get_navigation_tree() queries menu_nodes via g.s. But menu_nodes lives in db_blog. The T0.3 workaround (replicate table) works short-term; this is the proper fix.

Files:

shared/contracts/dtos.py — add MenuNodeDTO
blog/bp/data/routes.py — add nav-tree handler returning [dto_to_dict(node) for node in nodes]
All non-blog app.py context functions — replace get_navigation_tree(g.s) with fetch_data_cached("blog", "nav-tree", ttl=60)
Remove menu_nodes from non-blog DBs in split-databases.sh (no longer needed)

Depends on: T0.3 (DB split), T1.5 (data caching).

T1.7: Fix Fragment Batch Parser O(n^2)

File: shared/infrastructure/fragments.py _parse_fragment_markers() (lines 217-243)

Replace nested loop with single-pass: find all  markers in one scan, extract content between consecutive markers
O(n) instead of O(n^2)

T1.8: Read Replicas

Why: After DB split, each domain DB has one writer. Read-heavy pages (listings, calendars, product pages) can saturate it.

Files:

docker-compose.yml — add read replicas for high-traffic domains (blog, events)
shared/db/session.py — add DATABASE_URL_RO env var, create read-only engine, add get_read_session() context manager
All bp/data/routes.py and bp/fragments/routes.py — use read session (these endpoints are inherently read-only)

Depends on: T0.3 (DB split), T0.4 (PgBouncer).

TIER 2 — Decouple the Runtime

Target: hundreds of thousands concurrent. Effort: ~1 week.

T2.1: Edge-Side Fragment Composition (Nginx SSI)

Why: Currently every Quart app fetches 3-4 fragments per request via HTTP (fetch_fragments() in context processors). This adds latency and creates liveness coupling. SSI moves fragment assembly to Nginx, which caches each fragment independently.

Changes:

shared/infrastructure/jinja_setup.py — change _fragment() Jinja global to emit SSI directives: 
New Nginx config — map /_ssi/{app}/{path} to http://{app}:8000/internal/fragments/{path}, enable ssi on, proxy_cache with short TTL
All fragment blueprint routes — add Cache-Control: public, max-age={ttl}, stale-while-revalidate=300 headers
Remove fetch_fragments() calls from all app.py context processors (templates emit SSI directly)

T2.2: Replace Outbox Polling with Redis Streams

Why: EventProcessor polls ap_activities every 2s with SELECT FOR UPDATE SKIP LOCKED. This creates persistent DB load even when idle. LISTEN/NOTIFY helps but is fragile.

Changes:

shared/events/bus.py emit_activity() — after writing to DB, also XADD coop:activities:pending with {activity_id, origin_app, type}
shared/events/processor.py — replace _poll_loop + _listen_for_notify with XREADGROUP (blocking read, no polling). Consumer groups handle partitioning. XPENDING + XCLAIM replace the stuck activity reaper.
Redis Stream config: MAXLEN ~10000 to cap memory

T2.3: CDN for Static Assets

Route *.rose-ash.com/static/* through CDN (Cloudflare, BunnyCDN)
_asset_url() already adds ?v={hash} fingerprint — CDN can cache with max-age=31536000, immutable
No code changes, just DNS + CDN config

T2.4: Horizontal Scaling (Docker Swarm Replicas)

Changes:

docker-compose.yml — add replicas: 2 (or 3) to blog, market, events, cart. Keep federation at 1 (handles all AP delivery).
blog/entrypoint.sh — wrap Alembic in PostgreSQL advisory lock (SELECT pg_advisory_lock(42)) so only one replica runs migrations
docker-compose.yml — add health checks per service
shared/db/session.py — make pool sizes configurable via env vars (DB_POOL_SIZE, DB_MAX_OVERFLOW) so replicas can use smaller pools

Depends on: T0.4 (PgBouncer), T0.5 (workers).

TIER 3 — Federation Scale

Target: millions (federated network effects). Effort: weeks.

T3.1: Dedicated AP Delivery Service

Why: AP delivery is CPU-intensive (RSA signing) and I/O-intensive (HTTP to remote servers). Running inside federation's web worker blocks request processing.

Changes:

New delivery/ service — standalone asyncio app (no Quart/web server)
Reads from Redis Stream coop:delivery:pending (from T2.2)
Loads activity from federation DB, loads followers, delivers with semaphore (from T1.1)
ap_delivery_handler.py on_any_activity → enqueues to stream instead of delivering inline
docker-compose.yml — add delivery-worker service with replicas: 2, no port binding

Depends on: T2.2 (Redis Streams), T1.1 (concurrent delivery).

T3.2: Per-Domain Health Tracking + Backoff

Why: Dead remote servers waste delivery slots. Current code retries 5 times with no backoff.

Changes:

New shared/events/domain_health.py — Redis hash domain:health:{domain} tracking consecutive failures, exponential backoff schedule (30s → 1min → 5min → 15min → 1hr → 6hr → 24hr)
Delivery worker checks domain health before attempting delivery; skips domains in backoff
On success: reset. On failure: increment + extend backoff.

T3.3: Shared Inbox Optimization

Why: 100 Mastodon followers from one instance = 100 POSTs to the same server. Mastodon supports sharedInbox — one POST covers all followers on that instance.

Changes:

shared/models/federation.py APFollower — add shared_inbox_url column
Migration to backfill from ap_remote_actors.shared_inbox_url
ap_delivery_handler.py — group followers by domain, prefer shared inbox when available
Impact: 100 followers on one instance → 1 HTTP POST instead of 100

T3.4: Table Partitioning for `ap_activities`

Why: At millions of activities, the ap_activities table becomes the query bottleneck. The EventProcessor query orders by created_at — native range partitioning fits perfectly.

Changes:

Alembic migration — convert ap_activities to PARTITION BY RANGE (created_at) with monthly partitions
Add a cron job or startup hook to create future partitions
No application code changes needed (transparent to SQLAlchemy)

T3.5: Read-Through DTO Cache

Why: Hot cross-app reads (cart-summary, post-by-slug) go through HTTP even with T1.5 caching. A Redis-backed DTO cache with event-driven invalidation eliminates HTTP for repeated reads entirely.

Changes:

New shared/infrastructure/dto_cache.py — get(app, query, params) / set(...) backed by Redis
Integrate into fetch_data_cached() as L1 cache (check before HTTP)
Action endpoints invalidate relevant cache keys after successful writes

Depends on: T1.5 (data caching), T2.2 (event-driven invalidation via streams).

Implementation Order

TIER 0 (hours):
  T0.1 Auth Redis ──┐
  T0.2 Redis memory ├── all independent, do in parallel
  T0.3 DB split ────┘
  T0.4 PgBouncer ────── after T0.3
  T0.5 Workers ──────── after T0.4

TIER 1 (days):
  T1.1 Concurrent AP ──┐
  T1.3 Partition procs ─├── independent, do in parallel
  T1.4 Pool limits ─────┤
  T1.7 Parse fix ───────┘
  T1.2 Circuit breaker ── after T1.4
  T1.5 Data caching ───── after T1.2 (same pattern)
  T1.6 Nav endpoint ───── after T1.5
  T1.8 Read replicas ──── after T0.3 + T0.4

TIER 2 (~1 week):
  T2.3 CDN ──────────── independent, anytime
  T2.2 Redis Streams ─── foundation for T3.1
  T2.1 Nginx SSI ────── after T1.2 (fragments stable)
  T2.4 Replicas ──────── after T0.4 + T0.5

TIER 3 (weeks):
  T3.1 Delivery service ── after T2.2 + T1.1
  T3.2 Domain health ───── after T3.1
  T3.3 Shared inbox ────── after T3.1
  T3.4 Table partitioning ─ independent
  T3.5 DTO cache ────────── after T1.5 + T2.2

Verification

Each tier has a clear "it works" test:

Tier 0: All 6 apps respond. Login works across apps. pg_stat_activity shows connections through PgBouncer. redis-cli -p 6380 INFO memory shows auth Redis separate.
Tier 1: Emit a public AP activity with 50+ followers — delivery completes in seconds not minutes. Stop account service — blog pages still render with stale auth-menu. Only federation's processor handles delivery.
Tier 2: Nginx access log shows SSI fragment cache HITs. XINFO GROUPS coop:activities:pending shows active consumer groups. CDN cache-status headers show HITs on static assets. Multiple replicas serve traffic.
Tier 3: Delivery worker scales independently. Domain health Redis hash shows backoff state for unreachable servers. EXPLAIN on ap_activities shows partition pruning. Shared inbox delivery logs show 1 POST per domain.

16 KiB Raw Blame History

Rose Ash Scalability Plan

Context

TIER 0 — Deploy Existing Code + Quick Config

T0.1: Separate Auth Redis

T0.2: Bump Redis Memory

T0.3: Deploy the Database Split

T0.4: Add PgBouncer

T0.5: Hypercorn Workers

TIER 1 — Fix Hot Paths

T1.1: Concurrent AP Delivery

T1.2: Fragment Circuit Breaker + Stale-While-Revalidate

T1.3: Partition Event Processors

T1.4: httpx Client Pool Limits

T1.5: Data Client Caching

T1.6: Navigation via HTTP Data Endpoint

T1.7: Fix Fragment Batch Parser O(n^2)

T1.8: Read Replicas

TIER 2 — Decouple the Runtime

T2.1: Edge-Side Fragment Composition (Nginx SSI)

T2.2: Replace Outbox Polling with Redis Streams

T2.3: CDN for Static Assets

T2.4: Horizontal Scaling (Docker Swarm Replicas)

TIER 3 — Federation Scale

T3.1: Dedicated AP Delivery Service

T3.2: Per-Domain Health Tracking + Backoff

T3.3: Shared Inbox Optimization

T3.4: Table Partitioning for ap_activities

T3.5: Read-Through DTO Cache

Implementation Order

Verification

16 KiB

Raw Blame History

T3.4: Table Partitioning for `ap_activities`