Fix AP blueprint cross-DB queries + harden Ghost sync init
All checks were successful
Build and Deploy / build-and-deploy (push) Successful in 1m10s
All checks were successful
Build and Deploy / build-and-deploy (push) Successful in 1m10s
AP blueprints (activitypub.py, ap_social.py) were querying federation tables (ap_actor_profiles etc.) on g.s which points to the app's own DB after the per-app split. Now uses g._ap_s backed by get_federation_session() for non-federation apps. Also hardens Ghost sync before_app_serving to catch/rollback on failure instead of crashing the Hypercorn worker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
418
docs/decoupling-plan.md
Normal file
418
docs/decoupling-plan.md
Normal file
@@ -0,0 +1,418 @@
|
||||
# Rose Ash Decoupling Plan
|
||||
|
||||
## Context
|
||||
|
||||
The four Rose Ash apps (blog, market, cart, events) are tightly coupled through:
|
||||
- A shared model layer (`blog/shared_lib/models/`) containing ALL models for ALL apps
|
||||
- Cross-app foreign keys (calendars→posts, cart_items→market_places, calendar_entries→orders, etc.)
|
||||
- `Post` as the universal parent — calendars, markets, page_configs all hang off `post_id`
|
||||
- Internal HTTP calls for menu items, cart summaries, and login adoption
|
||||
|
||||
This makes it impossible to attach services to anything other than a Post, and means apps can't have independent databases. The goal is to decouple so apps are independently deployable, new services can be added easily, and the composition of "what's attached to what" is defined in a separate glue layer.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Extract `shared_lib` out of `blog/`
|
||||
|
||||
**What:** Move shared infrastructure into a top-level `shared/` package. Split models by ownership.
|
||||
|
||||
### New structure
|
||||
|
||||
```
|
||||
/root/rose-ash/
|
||||
shared/ # Extracted from blog/shared_lib/
|
||||
db/base.py, session.py # Unchanged
|
||||
models/ # ONLY shared models:
|
||||
user.py # User (used by all apps)
|
||||
kv.py # KV (settings)
|
||||
magic_link.py # MagicLink (auth)
|
||||
ghost_membership_entities.py # Ghost labels/newsletters/tiers/subscriptions
|
||||
menu_item.py # MenuItem (temporary, moves to glue in Phase 4)
|
||||
infrastructure/ # Renamed from shared/
|
||||
factory.py # create_base_app()
|
||||
internal_api.py # HTTP client for inter-app calls
|
||||
context.py # base_context()
|
||||
user_loader.py, jinja_setup.py, cart_identity.py, cart_loader.py, urls.py, http_utils.py
|
||||
browser/ # Renamed from suma_browser/
|
||||
(middleware, templates, csrf, errors, filters, redis, payments, authz)
|
||||
config.py, config/
|
||||
alembic/, static/, editor/
|
||||
|
||||
blog/models/ # Blog-owned models
|
||||
ghost_content.py # Post, Author, Tag, PostAuthor, PostTag, PostLike
|
||||
snippet.py # Snippet
|
||||
tag_group.py # TagGroup, TagGroupTag
|
||||
|
||||
market/models/ # Market-owned models
|
||||
market.py # Product, CartItem, NavTop, NavSub, Listing, etc.
|
||||
market_place.py # MarketPlace
|
||||
|
||||
cart/models/ # Cart-owned models
|
||||
order.py # Order, OrderItem
|
||||
page_config.py # PageConfig
|
||||
|
||||
events/models/ # Events-owned models
|
||||
calendars.py # Calendar, CalendarEntry, CalendarSlot, Ticket, TicketType
|
||||
```
|
||||
|
||||
### Key changes
|
||||
- Update `path_setup.py` in each app to add project root to `sys.path`
|
||||
- Update all `from models import X` → `from blog.models import X` / `from shared.models import X` etc.
|
||||
- Update `from db.base import Base` → `from shared.db.base import Base` in every model file
|
||||
- Update `from shared.factory import` → `from shared.infrastructure.factory import` in each `app.py`
|
||||
- Alembic `env.py` imports from all locations so `Base.metadata` sees every table
|
||||
- Add a transitional compat layer in old location that re-exports everything (remove later)
|
||||
|
||||
### Critical files to modify
|
||||
- `blog/app.py` (line 9: `from shared.factory`), `market/app.py`, `cart/app.py`, `events/app.py`
|
||||
- `blog/shared_lib/shared/factory.py` → `shared/infrastructure/factory.py`
|
||||
- Every model file (Base import)
|
||||
- `blog/shared_lib/alembic/env.py` → `shared/alembic/env.py`
|
||||
- Each app's `path_setup.py`
|
||||
|
||||
### Verify
|
||||
- All four apps start without import errors
|
||||
- `alembic check` produces no diff (schema unchanged)
|
||||
- All routes return correct responses
|
||||
- Internal API calls between apps still work
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Event Infrastructure + Logging
|
||||
|
||||
**What:** Add the durable event system (transactional outbox) and shared structured logging.
|
||||
|
||||
### 2a. DomainEvent model (the outbox)
|
||||
|
||||
New file: `shared/models/domain_event.py`
|
||||
|
||||
```python
|
||||
class DomainEvent(Base):
|
||||
__tablename__ = "domain_events"
|
||||
|
||||
id = Integer, primary_key
|
||||
event_type = String(128), indexed # "calendar.created", "order.completed"
|
||||
aggregate_type = String(64) # "calendar", "order"
|
||||
aggregate_id = Integer # ID of the thing that changed
|
||||
payload = JSONB # Event-specific data
|
||||
state = String(20), default "pending" # pending → processing → completed | failed
|
||||
attempts = Integer, default 0
|
||||
max_attempts = Integer, default 5
|
||||
last_error = Text, nullable
|
||||
created_at = DateTime, server_default now()
|
||||
processed_at = DateTime, nullable
|
||||
```
|
||||
|
||||
The critical property: `emit_event()` writes to this table **in the same DB transaction** as the domain change. If the app crashes after commit, the event is already persisted. If it crashes before commit, neither the domain change nor the event exists. This is atomic.
|
||||
|
||||
### 2b. Event bus
|
||||
|
||||
New directory: `shared/events/`
|
||||
|
||||
```
|
||||
shared/events/
|
||||
__init__.py # exports emit_event, register_handler, EventProcessor
|
||||
bus.py # emit_event(session, event_type, aggregate_type, aggregate_id, payload)
|
||||
# register_handler(event_type, async_handler_fn)
|
||||
processor.py # EventProcessor: polls domain_events table, dispatches to handlers
|
||||
```
|
||||
|
||||
**`emit_event(session, ...)`** — called within service functions, writes to outbox in current transaction
|
||||
**`register_handler(event_type, fn)`** — called at app startup (by glue layer) to register handlers
|
||||
**`EventProcessor`** — background polling loop:
|
||||
1. `SELECT ... FROM domain_events WHERE state='pending' FOR UPDATE SKIP LOCKED`
|
||||
2. Run all registered handlers for that event_type
|
||||
3. Mark completed or retry on failure
|
||||
4. Runs as an `asyncio.create_task` within each app process (started in `factory.py`)
|
||||
|
||||
### 2c. Structured logging
|
||||
|
||||
New directory: `shared/logging/`
|
||||
|
||||
```
|
||||
shared/logging/
|
||||
__init__.py
|
||||
setup.py # configure_logging(app_name), get_logger(name)
|
||||
```
|
||||
|
||||
- JSON-structured output to stdout (timestamp, level, app, message, plus optional fields: event_type, user_id, request_id, duration_ms)
|
||||
- `configure_logging(app_name)` called in `create_base_app()`
|
||||
- All apps get consistent log format; in production these go to a log aggregator
|
||||
|
||||
### 2d. Integration
|
||||
|
||||
Update `shared/infrastructure/factory.py`:
|
||||
- Call `configure_logging(name)` at app creation
|
||||
- Start `EventProcessor` as background task in `@app.before_serving`
|
||||
- Stop it in `@app.after_serving`
|
||||
|
||||
### Verify
|
||||
- `domain_events` table exists after migration
|
||||
- Call `emit_event()` in a test, verify row appears in table
|
||||
- `EventProcessor` picks up pending events and marks them completed
|
||||
- JSON logs appear on stdout with correct structure
|
||||
- No behavioral changes — this is purely additive infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Generic Container Concept
|
||||
|
||||
**What:** Replace cross-app `post_id` FKs with `container_type` + `container_id` soft references.
|
||||
|
||||
### Models to change
|
||||
|
||||
**Calendar** (`events/models/calendars.py`):
|
||||
```python
|
||||
# REMOVE: post_id = Column(Integer, ForeignKey("posts.id"), ...)
|
||||
# REMOVE: post = relationship("Post", ...)
|
||||
# ADD:
|
||||
container_type = Column(String(32), nullable=False) # "page", "market", etc.
|
||||
container_id = Column(Integer, nullable=False)
|
||||
```
|
||||
|
||||
**MarketPlace** (`market/models/market_place.py`):
|
||||
```python
|
||||
# Same pattern: remove post_id FK, add container_type + container_id
|
||||
```
|
||||
|
||||
**PageConfig** (`cart/models/page_config.py`):
|
||||
```python
|
||||
# Same pattern
|
||||
```
|
||||
|
||||
**CalendarEntryPost** → rename to **CalendarEntryContent**:
|
||||
```python
|
||||
# REMOVE: post_id FK
|
||||
# ADD: content_type + content_id (generic reference)
|
||||
```
|
||||
|
||||
### From Post model (`blog/models/ghost_content.py`), remove:
|
||||
- `calendars` relationship
|
||||
- `markets` relationship
|
||||
- `page_config` relationship
|
||||
- `calendar_entries` relationship (via CalendarEntryPost)
|
||||
- `menu_items` relationship (moves to glue in Phase 4)
|
||||
|
||||
### Helper in `shared/containers.py`:
|
||||
```python
|
||||
class ContainerType:
|
||||
PAGE = "page"
|
||||
# Future: MARKET = "market", GROUP = "group", etc.
|
||||
|
||||
def container_filter(model, container_type, container_id):
|
||||
"""Return SQLAlchemy filter clauses."""
|
||||
return [model.container_type == container_type, model.container_id == container_id]
|
||||
```
|
||||
|
||||
### Three-step migration (non-breaking)
|
||||
1. **Add columns** (nullable) — keeps old post_id FK intact
|
||||
2. **Backfill** — `UPDATE calendars SET container_type='page', container_id=post_id`; make NOT NULL
|
||||
3. **Drop old FK** — remove post_id column and FK constraint
|
||||
|
||||
### Update all queries
|
||||
Key files that reference `Calendar.post_id`, `MarketPlace.post_id`, `PageConfig.post_id`:
|
||||
- `events/app.py` (~line 108)
|
||||
- `market/app.py` (~line 119)
|
||||
- `cart/app.py` (~line 131)
|
||||
- `cart/bp/cart/services/checkout.py` (lines 77-85, 160-163) — `resolve_page_config()` and `create_order_from_cart()`
|
||||
- `cart/bp/cart/services/page_cart.py`
|
||||
- `cart/bp/cart/api.py`
|
||||
|
||||
All change from `X.post_id == post.id` to `X.container_type == "page", X.container_id == post.id`.
|
||||
|
||||
### Verify
|
||||
- Creating a calendar/market/page_config uses container_type + container_id
|
||||
- Cart checkout still resolves correct page config via container references
|
||||
- No cross-app FKs remain for these three models
|
||||
- Alembic migration is clean
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Glue Layer
|
||||
|
||||
**What:** New top-level `glue/` package that owns container relationships, navigation, and event handlers.
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
/root/rose-ash/glue/
|
||||
__init__.py
|
||||
models/
|
||||
container_relation.py # Parent-child container relationships
|
||||
menu_node.py # Navigation tree (replaces MenuItem)
|
||||
services/
|
||||
navigation.py # Build menu from relationship tree
|
||||
relationships.py # attach_child(), get_children(), detach_child()
|
||||
handlers/
|
||||
calendar_handlers.py # on calendar attached → rebuild nav
|
||||
market_handlers.py # on market attached → rebuild nav
|
||||
order_handlers.py # on order completed → confirm calendar entries
|
||||
login_handlers.py # on login → adopt anonymous cart/calendar items
|
||||
setup.py # Registers all handlers at app startup
|
||||
```
|
||||
|
||||
### ContainerRelation model
|
||||
```python
|
||||
class ContainerRelation(Base):
|
||||
__tablename__ = "container_relations"
|
||||
id, parent_type, parent_id, child_type, child_id, sort_order, label, created_at, deleted_at
|
||||
# Unique constraint: (parent_type, parent_id, child_type, child_id)
|
||||
```
|
||||
|
||||
This is the central truth about "what's attached to what." A page has calendars and markets attached to it — defined here, not by FKs on the calendar/market tables.
|
||||
|
||||
### MenuNode model (replaces MenuItem)
|
||||
```python
|
||||
class MenuNode(Base):
|
||||
__tablename__ = "menu_nodes"
|
||||
id, container_type, container_id,
|
||||
parent_id (self-referential tree), sort_order, depth,
|
||||
label, slug, href, icon, feature_image,
|
||||
created_at, updated_at, deleted_at
|
||||
```
|
||||
|
||||
This is a **cached navigation tree** built FROM ContainerRelations. A page doesn't know it has markets — but its MenuNode has child MenuNodes for the market because the glue layer put them there.
|
||||
|
||||
### Navigation service (`glue/services/navigation.py`)
|
||||
- `get_navigation_tree(session)` → nested dict for templates (replaces `/internal/menu-items` API)
|
||||
- `rebuild_navigation(session)` → reads ContainerRelations, creates/updates MenuNodes
|
||||
- Called by event handlers when relationships change
|
||||
|
||||
### Relationship service (`glue/services/relationships.py`)
|
||||
- `attach_child(session, parent_type, parent_id, child_type, child_id)` → creates ContainerRelation + emits `container.child_attached` event
|
||||
- `get_children(session, parent_type, parent_id, child_type=None)` → query children
|
||||
- `detach_child(...)` → soft delete + emit `container.child_detached` event
|
||||
|
||||
### Event handlers (the "real code" in the glue layer)
|
||||
```python
|
||||
# glue/handlers/calendar_handlers.py
|
||||
@handler("container.child_attached")
|
||||
async def on_child_attached(payload, session):
|
||||
if payload["child_type"] in ("calendar", "market"):
|
||||
await rebuild_navigation(session)
|
||||
|
||||
# glue/handlers/order_handlers.py (Phase 5 but registered here)
|
||||
@handler("order.created")
|
||||
async def on_order_created(payload, session):
|
||||
# Confirm calendar entries for this order
|
||||
...
|
||||
|
||||
# glue/handlers/login_handlers.py (Phase 5 but registered here)
|
||||
@handler("user.logged_in")
|
||||
async def on_user_logged_in(payload, session):
|
||||
# Adopt anonymous cart items and calendar entries
|
||||
...
|
||||
```
|
||||
|
||||
### Replace menu_items flow
|
||||
**Old:** Each app calls `GET /internal/menu-items` → coop queries MenuItem → returns JSON
|
||||
**New:** Each app calls `glue.services.navigation.get_navigation_tree(g.s)` → direct DB query of MenuNode
|
||||
|
||||
Update context functions in all four `app.py` files:
|
||||
```python
|
||||
# REMOVE: menu_data = await api_get("coop", "/internal/menu-items")
|
||||
# ADD: from glue.services.navigation import get_navigation_tree
|
||||
# ctx["menu_items"] = await get_navigation_tree(g.s)
|
||||
```
|
||||
|
||||
### Data migration
|
||||
- Backfill `menu_nodes` from existing `menu_items` + posts
|
||||
- Backfill `container_relations` from existing calendar/market/page_config container references
|
||||
- Deprecate (then remove) old MenuItem model and `/internal/menu-items` endpoint
|
||||
- Update menu admin UI (`blog/bp/menu_items/`) to manage ContainerRelations + MenuNodes
|
||||
|
||||
### Verify
|
||||
- Navigation renders correctly in all four apps without HTTP calls
|
||||
- Adding a market to a page (via ContainerRelation) triggers nav rebuild and market appears in menu
|
||||
- Adding a calendar to a page does the same
|
||||
- Menu admin UI works with new models
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Event-Driven Cross-App Workflows
|
||||
|
||||
**What:** Replace remaining cross-app FKs and HTTP calls with event-driven flows.
|
||||
|
||||
### 5a. Replace `CalendarEntry.order_id` FK with soft reference
|
||||
|
||||
```python
|
||||
# REMOVE: order_id = Column(Integer, ForeignKey("orders.id"), ...)
|
||||
# ADD: order_ref_id = Column(Integer, nullable=True, index=True)
|
||||
# (No FK constraint — just stores the order ID as an integer)
|
||||
```
|
||||
|
||||
Same for `Ticket.order_id`. Three-step migration (add, backfill, drop FK).
|
||||
|
||||
### 5b. Replace `CartItem.market_place_id` FK with soft reference
|
||||
```python
|
||||
# REMOVE: market_place_id = ForeignKey("market_places.id")
|
||||
# ADD: market_ref_id = Column(Integer, nullable=True, index=True)
|
||||
```
|
||||
|
||||
### 5c. Event-driven order completion
|
||||
Currently `cart/bp/cart/services/checkout.py` line 166 directly writes `CalendarEntry` rows (cross-domain). Replace:
|
||||
|
||||
```python
|
||||
# In create_order_from_cart(), instead of direct UPDATE on CalendarEntry:
|
||||
await emit_event(session, "order.created", "order", order.id, {
|
||||
"order_id": order.id,
|
||||
"user_id": user_id,
|
||||
"calendar_entry_ids": [...],
|
||||
})
|
||||
```
|
||||
|
||||
Glue handler picks it up and updates calendar entries via events-domain code.
|
||||
|
||||
### 5d. Event-driven login adoption
|
||||
Currently `blog/bp/auth/routes.py` line 265 calls `POST /internal/cart/adopt`. Replace:
|
||||
|
||||
```python
|
||||
# In magic() route, instead of api_post("cart", "/internal/cart/adopt"):
|
||||
await emit_event(session, "user.logged_in", "user", user_id, {
|
||||
"user_id": user_id,
|
||||
"anonymous_session_id": anon_session_id,
|
||||
})
|
||||
```
|
||||
|
||||
Glue handler adopts cart items and calendar entries.
|
||||
|
||||
### 5e. Remove cross-domain ORM relationships
|
||||
From models, remove:
|
||||
- `Order.calendar_entries` (relationship to CalendarEntry)
|
||||
- `CalendarEntry.order` (relationship to Order)
|
||||
- `Ticket.order` (relationship to Order)
|
||||
- `CartItem.market_place` (relationship to MarketPlace)
|
||||
|
||||
### 5f. Move cross-domain queries to glue services
|
||||
`cart/bp/cart/services/checkout.py` currently imports `CalendarEntry`, `Calendar`, `MarketPlace` directly. Move these queries to glue service functions that bridge the domains:
|
||||
- `glue/services/cart_calendar.py` — query calendar entries for a cart identity
|
||||
- `glue/services/page_resolution.py` — determine which page/container a cart belongs to using ContainerRelation
|
||||
|
||||
### Final FK audit after Phase 5
|
||||
All remaining FKs are either:
|
||||
- **Within the same app domain** (Order→OrderItem, Calendar→CalendarSlot, etc.)
|
||||
- **To shared models** (anything→User)
|
||||
- **One pragmatic exception:** `OrderItem.product_id → products.id` (cross cart→market, but OrderItem already snapshots title/price, so this FK is just for reporting)
|
||||
|
||||
### Verify
|
||||
- Login triggers `user.logged_in` event → cart/calendar adoption happens via glue handler
|
||||
- Order creation triggers `order.created` event → calendar entries confirmed via glue handler
|
||||
- No cross-app FKs remain (except the pragmatic OrderItem→Product)
|
||||
- All apps could theoretically point at separate databases
|
||||
- Event processor reliably processes and retries all events
|
||||
|
||||
---
|
||||
|
||||
## Execution Order
|
||||
|
||||
Each phase leaves the system fully working. No big-bang migration.
|
||||
|
||||
| Phase | Risk | Size | Depends On |
|
||||
|-------|------|------|------------|
|
||||
| 1. Extract shared_lib | Low (mechanical refactor) | Medium | Nothing |
|
||||
| 2. Event infra + logging | Low (purely additive) | Small | Phase 1 |
|
||||
| 3. Generic containers | Medium (schema + query changes) | Medium | Phase 1 |
|
||||
| 4. Glue layer | Medium (new subsystem, menu migration) | Large | Phases 2 + 3 |
|
||||
| 5. Event-driven workflows | Medium (behavioral change in checkout/login) | Medium | Phase 4 |
|
||||
|
||||
Phases 2 and 3 can run in parallel after Phase 1. Phase 4 needs both. Phase 5 needs Phase 4.
|
||||
302
docs/scalability-plan.md
Normal file
302
docs/scalability-plan.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# Rose Ash Scalability Plan
|
||||
|
||||
## Context
|
||||
|
||||
The coop runs 6 Quart microservices (blog, market, cart, events, federation, account) + Art-DAG (L1/L2 FastAPI) on a single-node Docker Swarm. Current architecture handles light traffic but has concrete bottlenecks: single Postgres, single Redis (256MB), no replicas, single Hypercorn worker per container, sequential AP delivery, no circuit breakers, and 6 competing EventProcessors. The decoupling work (contracts, HTTP data/actions, fragment composition) is complete — the code is structurally ready to scale, but the deployment and runtime aren't.
|
||||
|
||||
This plan covers everything from quick config wins to massive federation scale, organized in 4 tiers. Each tier unlocks roughly an order of magnitude.
|
||||
|
||||
---
|
||||
|
||||
## TIER 0 — Deploy Existing Code + Quick Config
|
||||
|
||||
**Target: low thousands concurrent. Effort: hours.**
|
||||
|
||||
### T0.1: Separate Auth Redis
|
||||
|
||||
**Why:** Auth keys (`grant:*`, `did_auth:*`, `prompt:*`) on DB 15 share a single Redis instance (256MB, `allkeys-lru`). Cache pressure from fragment/page caching can silently evict auth state, causing spurious logouts.
|
||||
|
||||
**Files:**
|
||||
- `docker-compose.yml` — add `redis-auth` service: `redis:7-alpine`, `--maxmemory 64mb --maxmemory-policy noeviction`
|
||||
- `docker-compose.yml` — update `REDIS_AUTH_URL: redis://redis-auth:6379/0` in `x-app-env`
|
||||
|
||||
**No code changes** — `shared/infrastructure/auth_redis.py` already reads `REDIS_AUTH_URL` from env.
|
||||
|
||||
### T0.2: Bump Redis Memory
|
||||
|
||||
**File:** `docker-compose.yml` (line 168)
|
||||
- Change `--maxmemory 256mb` to `--maxmemory 1gb` (or 512mb minimum)
|
||||
- Keep `allkeys-lru` for the data Redis (fragments + page cache)
|
||||
|
||||
### T0.3: Deploy the Database Split
|
||||
|
||||
**What exists:** `_config/init-databases.sql` (creates 6 DBs) and `_config/split-databases.sh` (migrates table groups). Code in `session.py` lines 46-101 already creates separate engines when URLs differ. `bus.py`, `user_loader.py`, and `factory.py` all have conditional cross-DB paths.
|
||||
|
||||
**Files:**
|
||||
- `docker-compose.yml` — per-service `DATABASE_URL` overrides (blog→`db_blog`, market→`db_market`, etc.) plus `DATABASE_URL_ACCOUNT`→`db_account`, `DATABASE_URL_FEDERATION`→`db_federation`
|
||||
- `_config/split-databases.sh` — add `menu_nodes` + `container_relations` to ALL target DBs (small read-only tables needed by `get_navigation_tree()` in non-blog apps until T1.7 replaces this)
|
||||
|
||||
**Deployment:** run `init-databases.sql`, stop services, run `split-databases.sh`, update compose env, redeploy.
|
||||
|
||||
### T0.4: Add PgBouncer
|
||||
|
||||
**Why:** 6 apps x pool_size 5 + overflow 10 = 90 connections to one Postgres. After DB split + workers, this multiplies. Postgres default `max_connections=100` will be hit.
|
||||
|
||||
**Files:**
|
||||
- `docker-compose.yml` — add `pgbouncer` service (transaction-mode pooling, `default_pool_size=20`, `max_client_conn=300`)
|
||||
- `docker-compose.yml` — change all `DATABASE_URL` values from `@db:5432` to `@pgbouncer:6432`
|
||||
- `shared/db/session.py` (lines 13-20) — add `pool_timeout=10`, `pool_recycle=1800`. Consider reducing `pool_size` to 3 since PgBouncer handles pooling.
|
||||
|
||||
### T0.5: Hypercorn Workers
|
||||
|
||||
**Why:** Single async event loop per container. CPU-bound work (Jinja2 rendering, RSA signing) blocks everything.
|
||||
|
||||
**Files:**
|
||||
- All 6 `{app}/entrypoint.sh` — change Hypercorn command to:
|
||||
```
|
||||
exec hypercorn "${APP_MODULE:-app:app}" --bind 0.0.0.0:${PORT:-8000} --workers ${WORKERS:-2} --keep-alive 75
|
||||
```
|
||||
|
||||
**No code changes needed** — `EventProcessor` uses `SKIP LOCKED` so multiple workers competing is safe. Each worker gets its own httpx client singleton and SQLAlchemy pool (correct fork behavior).
|
||||
|
||||
**Depends on:** T0.4 (PgBouncer) — doubling workers doubles connection count.
|
||||
|
||||
---
|
||||
|
||||
## TIER 1 — Fix Hot Paths
|
||||
|
||||
**Target: tens of thousands concurrent. Effort: days.**
|
||||
|
||||
### T1.1: Concurrent AP Delivery
|
||||
|
||||
**The single biggest bottleneck.** `ap_delivery_handler.py` delivers to inboxes sequentially (lines 230-246). 100 followers = up to 100 x 15s = 25 minutes.
|
||||
|
||||
**File:** `shared/events/handlers/ap_delivery_handler.py`
|
||||
- Add `DELIVERY_CONCURRENCY = 10` semaphore
|
||||
- Replace sequential inbox loop with `asyncio.gather()` bounded by semaphore
|
||||
- Reuse a single `httpx.AsyncClient` per `on_any_activity` call with explicit pool limits: `httpx.Limits(max_connections=20, max_keepalive_connections=10)`
|
||||
- Batch `APDeliveryLog` inserts (one flush per activity, not per inbox)
|
||||
|
||||
**Result:** 100 followers drops from 25 min → ~2.5 min (10 concurrent) or less.
|
||||
|
||||
### T1.2: Fragment Circuit Breaker + Stale-While-Revalidate
|
||||
|
||||
**Why:** Every page render makes 3-4 internal HTTP calls (`fetch_fragments()`). A slow/down service blocks all page renders for up to 2s per fragment. No graceful degradation.
|
||||
|
||||
**File:** `shared/infrastructure/fragments.py`
|
||||
|
||||
**Circuit breaker** (add near line 27):
|
||||
- Per-app `_CircuitState` tracking consecutive failures
|
||||
- Threshold: 3 failures → circuit opens for 30s
|
||||
- When open: skip HTTP, fall through to stale cache
|
||||
|
||||
**Stale-while-revalidate** (modify `fetch_fragment_cached()`, lines 246-288):
|
||||
- Store fragments in Redis as `{"html": "...", "ts": 1234567890.0}` instead of plain string
|
||||
- Soft TTL = normal TTL (30s). Hard TTL = soft + 300s (5 min stale window)
|
||||
- Within soft TTL: return cached. Between soft and hard: return cached + background revalidate. Past hard: block on fetch. On fetch failure: return stale if available, empty string if not.
|
||||
|
||||
### T1.3: Partition Event Processors
|
||||
|
||||
**Why:** All 6 apps register the wildcard AP delivery handler via `register_shared_handlers()`. All 6 `EventProcessor` instances compete for every activity with `SKIP LOCKED`. 5 of 6 do wasted work on every public activity.
|
||||
|
||||
**Files:**
|
||||
- `shared/events/handlers/__init__.py` — add `app_name` param to `register_shared_handlers()`. Only import `ap_delivery_handler` and `external_delivery_handler` when `app_name == "federation"`.
|
||||
- `shared/infrastructure/factory.py` (line 277) — pass `name` to `register_shared_handlers(name)`
|
||||
- `shared/infrastructure/factory.py` (line 41) — add `event_processor_all_origins: bool = False` param to `create_base_app()`
|
||||
- `shared/infrastructure/factory.py` (line 271) — `EventProcessor(app_name=None if event_processor_all_origins else name)`
|
||||
- `federation/app.py` — pass `event_processor_all_origins=True`
|
||||
|
||||
**Result:** Federation processes ALL activities (no origin filter) for delivery. Other apps process only their own origin activities for domain-specific handlers. No wasted lock contention.
|
||||
|
||||
### T1.4: httpx Client Pool Limits
|
||||
|
||||
**Why:** All three HTTP clients (`fragments.py`, `data_client.py`, `actions.py`) have no `limits` parameter. Default is 100 connections per client. With workers + replicas this fans out unbounded.
|
||||
|
||||
**Files:**
|
||||
- `shared/infrastructure/fragments.py` (lines 38-45) — add `limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)`
|
||||
- `shared/infrastructure/data_client.py` (lines 36-43) — same
|
||||
- `shared/infrastructure/actions.py` (lines 37-44) — `max_connections=10, max_keepalive_connections=5`
|
||||
|
||||
### T1.5: Data Client Caching
|
||||
|
||||
**Why:** `fetch_data()` has no caching. `cart-summary` is called on every page load across 4 apps. Blog post lookups by slug happen on every market/events page.
|
||||
|
||||
**File:** `shared/infrastructure/data_client.py`
|
||||
- Add `fetch_data_cached()` following the `fetch_fragment_cached()` pattern
|
||||
- Redis key: `data:{app}:{query}:{sorted_params}`, default TTL=10s
|
||||
- Same circuit breaker + SWR as T1.2
|
||||
|
||||
**Callers updated:** all `app.py` context functions that call `fetch_data()` for repeated reads.
|
||||
|
||||
### T1.6: Navigation via HTTP Data Endpoint
|
||||
|
||||
**Why:** After DB split, `get_navigation_tree()` queries `menu_nodes` via `g.s`. But `menu_nodes` lives in `db_blog`. The T0.3 workaround (replicate table) works short-term; this is the proper fix.
|
||||
|
||||
**Files:**
|
||||
- `shared/contracts/dtos.py` — add `MenuNodeDTO`
|
||||
- `blog/bp/data/routes.py` — add `nav-tree` handler returning `[dto_to_dict(node) for node in nodes]`
|
||||
- All non-blog `app.py` context functions — replace `get_navigation_tree(g.s)` with `fetch_data_cached("blog", "nav-tree", ttl=60)`
|
||||
- Remove `menu_nodes` from non-blog DBs in `split-databases.sh` (no longer needed)
|
||||
|
||||
**Depends on:** T0.3 (DB split), T1.5 (data caching).
|
||||
|
||||
### T1.7: Fix Fragment Batch Parser O(n^2)
|
||||
|
||||
**File:** `shared/infrastructure/fragments.py` `_parse_fragment_markers()` (lines 217-243)
|
||||
- Replace nested loop with single-pass: find all `<!-- fragment:{key} -->` markers in one scan, extract content between consecutive markers
|
||||
- O(n) instead of O(n^2)
|
||||
|
||||
### T1.8: Read Replicas
|
||||
|
||||
**Why:** After DB split, each domain DB has one writer. Read-heavy pages (listings, calendars, product pages) can saturate it.
|
||||
|
||||
**Files:**
|
||||
- `docker-compose.yml` — add read replicas for high-traffic domains (blog, events)
|
||||
- `shared/db/session.py` — add `DATABASE_URL_RO` env var, create read-only engine, add `get_read_session()` context manager
|
||||
- All `bp/data/routes.py` and `bp/fragments/routes.py` — use read session (these endpoints are inherently read-only)
|
||||
|
||||
**Depends on:** T0.3 (DB split), T0.4 (PgBouncer).
|
||||
|
||||
---
|
||||
|
||||
## TIER 2 — Decouple the Runtime
|
||||
|
||||
**Target: hundreds of thousands concurrent. Effort: ~1 week.**
|
||||
|
||||
### T2.1: Edge-Side Fragment Composition (Nginx SSI)
|
||||
|
||||
**Why:** Currently every Quart app fetches 3-4 fragments per request via HTTP (`fetch_fragments()` in context processors). This adds latency and creates liveness coupling. SSI moves fragment assembly to Nginx, which caches each fragment independently.
|
||||
|
||||
**Changes:**
|
||||
- `shared/infrastructure/jinja_setup.py` — change `_fragment()` Jinja global to emit SSI directives: `<!--#include virtual="/_ssi/{app}/{type}?{params}" -->`
|
||||
- New Nginx config — map `/_ssi/{app}/{path}` to `http://{app}:8000/internal/fragments/{path}`, enable `ssi on`, proxy_cache with short TTL
|
||||
- All fragment blueprint routes — add `Cache-Control: public, max-age={ttl}, stale-while-revalidate=300` headers
|
||||
- Remove `fetch_fragments()` calls from all `app.py` context processors (templates emit SSI directly)
|
||||
|
||||
### T2.2: Replace Outbox Polling with Redis Streams
|
||||
|
||||
**Why:** `EventProcessor` polls `ap_activities` every 2s with `SELECT FOR UPDATE SKIP LOCKED`. This creates persistent DB load even when idle. LISTEN/NOTIFY helps but is fragile.
|
||||
|
||||
**Changes:**
|
||||
- `shared/events/bus.py` `emit_activity()` — after writing to DB, also `XADD coop:activities:pending` with `{activity_id, origin_app, type}`
|
||||
- `shared/events/processor.py` — replace `_poll_loop` + `_listen_for_notify` with `XREADGROUP` (blocking read, no polling). Consumer groups handle partitioning. `XPENDING` + `XCLAIM` replace the stuck activity reaper.
|
||||
- Redis Stream config: `MAXLEN ~10000` to cap memory
|
||||
|
||||
### T2.3: CDN for Static Assets
|
||||
|
||||
- Route `*.rose-ash.com/static/*` through CDN (Cloudflare, BunnyCDN)
|
||||
- `_asset_url()` already adds `?v={hash}` fingerprint — CDN can cache with `max-age=31536000, immutable`
|
||||
- No code changes, just DNS + CDN config
|
||||
|
||||
### T2.4: Horizontal Scaling (Docker Swarm Replicas)
|
||||
|
||||
**Changes:**
|
||||
- `docker-compose.yml` — add `replicas: 2` (or 3) to blog, market, events, cart. Keep federation at 1 (handles all AP delivery).
|
||||
- `blog/entrypoint.sh` — wrap Alembic in PostgreSQL advisory lock (`SELECT pg_advisory_lock(42)`) so only one replica runs migrations
|
||||
- `docker-compose.yml` — add health checks per service
|
||||
- `shared/db/session.py` — make pool sizes configurable via env vars (`DB_POOL_SIZE`, `DB_MAX_OVERFLOW`) so replicas can use smaller pools
|
||||
|
||||
**Depends on:** T0.4 (PgBouncer), T0.5 (workers).
|
||||
|
||||
---
|
||||
|
||||
## TIER 3 — Federation Scale
|
||||
|
||||
**Target: millions (federated network effects). Effort: weeks.**
|
||||
|
||||
### T3.1: Dedicated AP Delivery Service
|
||||
|
||||
**Why:** AP delivery is CPU-intensive (RSA signing) and I/O-intensive (HTTP to remote servers). Running inside federation's web worker blocks request processing.
|
||||
|
||||
**Changes:**
|
||||
- New `delivery/` service — standalone asyncio app (no Quart/web server)
|
||||
- Reads from Redis Stream `coop:delivery:pending` (from T2.2)
|
||||
- Loads activity from federation DB, loads followers, delivers with semaphore (from T1.1)
|
||||
- `ap_delivery_handler.py` `on_any_activity` → enqueues to stream instead of delivering inline
|
||||
- `docker-compose.yml` — add `delivery-worker` service with `replicas: 2`, no port binding
|
||||
|
||||
**Depends on:** T2.2 (Redis Streams), T1.1 (concurrent delivery).
|
||||
|
||||
### T3.2: Per-Domain Health Tracking + Backoff
|
||||
|
||||
**Why:** Dead remote servers waste delivery slots. Current code retries 5 times with no backoff.
|
||||
|
||||
**Changes:**
|
||||
- New `shared/events/domain_health.py` — Redis hash `domain:health:{domain}` tracking consecutive failures, exponential backoff schedule (30s → 1min → 5min → 15min → 1hr → 6hr → 24hr)
|
||||
- Delivery worker checks domain health before attempting delivery; skips domains in backoff
|
||||
- On success: reset. On failure: increment + extend backoff.
|
||||
|
||||
### T3.3: Shared Inbox Optimization
|
||||
|
||||
**Why:** 100 Mastodon followers from one instance = 100 POSTs to the same server. Mastodon supports `sharedInbox` — one POST covers all followers on that instance.
|
||||
|
||||
**Changes:**
|
||||
- `shared/models/federation.py` `APFollower` — add `shared_inbox_url` column
|
||||
- Migration to backfill from `ap_remote_actors.shared_inbox_url`
|
||||
- `ap_delivery_handler.py` — group followers by domain, prefer shared inbox when available
|
||||
- **Impact:** 100 followers on one instance → 1 HTTP POST instead of 100
|
||||
|
||||
### T3.4: Table Partitioning for `ap_activities`
|
||||
|
||||
**Why:** At millions of activities, the `ap_activities` table becomes the query bottleneck. The `EventProcessor` query orders by `created_at` — native range partitioning fits perfectly.
|
||||
|
||||
**Changes:**
|
||||
- Alembic migration — convert `ap_activities` to `PARTITION BY RANGE (created_at)` with monthly partitions
|
||||
- Add a cron job or startup hook to create future partitions
|
||||
- No application code changes needed (transparent to SQLAlchemy)
|
||||
|
||||
### T3.5: Read-Through DTO Cache
|
||||
|
||||
**Why:** Hot cross-app reads (`cart-summary`, `post-by-slug`) go through HTTP even with T1.5 caching. A Redis-backed DTO cache with event-driven invalidation eliminates HTTP for repeated reads entirely.
|
||||
|
||||
**Changes:**
|
||||
- New `shared/infrastructure/dto_cache.py` — `get(app, query, params)` / `set(...)` backed by Redis
|
||||
- Integrate into `fetch_data_cached()` as L1 cache (check before HTTP)
|
||||
- Action endpoints invalidate relevant cache keys after successful writes
|
||||
|
||||
**Depends on:** T1.5 (data caching), T2.2 (event-driven invalidation via streams).
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
```
|
||||
TIER 0 (hours):
|
||||
T0.1 Auth Redis ──┐
|
||||
T0.2 Redis memory ├── all independent, do in parallel
|
||||
T0.3 DB split ────┘
|
||||
T0.4 PgBouncer ────── after T0.3
|
||||
T0.5 Workers ──────── after T0.4
|
||||
|
||||
TIER 1 (days):
|
||||
T1.1 Concurrent AP ──┐
|
||||
T1.3 Partition procs ─├── independent, do in parallel
|
||||
T1.4 Pool limits ─────┤
|
||||
T1.7 Parse fix ───────┘
|
||||
T1.2 Circuit breaker ── after T1.4
|
||||
T1.5 Data caching ───── after T1.2 (same pattern)
|
||||
T1.6 Nav endpoint ───── after T1.5
|
||||
T1.8 Read replicas ──── after T0.3 + T0.4
|
||||
|
||||
TIER 2 (~1 week):
|
||||
T2.3 CDN ──────────── independent, anytime
|
||||
T2.2 Redis Streams ─── foundation for T3.1
|
||||
T2.1 Nginx SSI ────── after T1.2 (fragments stable)
|
||||
T2.4 Replicas ──────── after T0.4 + T0.5
|
||||
|
||||
TIER 3 (weeks):
|
||||
T3.1 Delivery service ── after T2.2 + T1.1
|
||||
T3.2 Domain health ───── after T3.1
|
||||
T3.3 Shared inbox ────── after T3.1
|
||||
T3.4 Table partitioning ─ independent
|
||||
T3.5 DTO cache ────────── after T1.5 + T2.2
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Each tier has a clear "it works" test:
|
||||
|
||||
- **Tier 0:** All 6 apps respond. Login works across apps. `pg_stat_activity` shows connections through PgBouncer. `redis-cli -p 6380 INFO memory` shows auth Redis separate.
|
||||
- **Tier 1:** Emit a public AP activity with 50+ followers — delivery completes in seconds not minutes. Stop account service — blog pages still render with stale auth-menu. Only federation's processor handles delivery.
|
||||
- **Tier 2:** Nginx access log shows SSI fragment cache HITs. `XINFO GROUPS coop:activities:pending` shows active consumer groups. CDN cache-status headers show HITs on static assets. Multiple replicas serve traffic.
|
||||
- **Tier 3:** Delivery worker scales independently. Domain health Redis hash shows backoff state for unreachable servers. `EXPLAIN` on `ap_activities` shows partition pruning. Shared inbox delivery logs show 1 POST per domain.
|
||||
Reference in New Issue
Block a user