diff --git a/plans/rose-ash-on-sx-migration.md b/plans/rose-ash-on-sx-migration.md new file mode 100644 index 00000000..c2a04a33 --- /dev/null +++ b/plans/rose-ash-on-sx-migration.md @@ -0,0 +1,170 @@ +# Re-implementing rose-ash on SX — migration strategy + +Status: **strategy proposal** (drafted by the `radar` loop, 2026-06-07). Not a +unilateral architecture decision — a starting point for the fleet to refine. Radar's +role here is detection: the `*-on-sx` subsystems have converged into a host-agnostic +re-implementation of rose-ash's domain logic, so this doc proposes *when* and *how* to +wire them to production. + +--- + +## 1. Premise: we are ~70% into a re-implementation already + +The fleet of `lib/` SX subsystems is not a set of experiments — it is rose-ash's +domain logic, re-expressed substrate-by-substrate, deliberately **host-agnostic**: + +| SX subsystem (`lib/`) | rose-ash production domain | +|---|---| +| content-on-sx (CRDT docs, versioning, `page.sx` HTML render) | **blog** | +| commerce-on-sx (catalog, pricing, cart, order + refund sagas) | **market + cart + orders** | +| events-on-sx (calendar, ticketing, booking) | **events** | +| feed-on-sx (activity streams, AP-shaped, threading) | **federation** | +| identity-on-sx (OAuth2, sessions, grants, membership) | **account** | +| acl-on-sx (permissions) | cross-cutting authZ | +| relations / likes | **relations / likes** (internal) | +| persist-on-sx (log / kv / snapshot facets) | per-service Postgres layer | +| flow-on-sx (durable sagas) | order/refund/delivery workflows | +| mod-on-sx, search-on-sx | new capabilities | + +**The architectural enabler:** every core was built with *injected seams* — `permit?`, +`send-fn`/`fetch-fn`, `transport`, `dispatch`, `backend`. That is ports-and-adapters +(hexagonal) on purpose. Evidence from the radar backlog (`plans/abstractions.md`): +W1 (7/7 federation modules inject the fed-sx transport), W4 (content/commerce/events run +live on `persist/log`), W8 (events+commerce run sagas on `lib/flow`). **The cores do not +depend on how they're hosted, persisted, or federated.** + +**Corollary that makes the whole migration tractable:** because logic is separated from +rendering and storage, we can hold the **domain logic to parity** while **freely +redesigning the presentation** — the two are different layers with different rules. + +--- + +## 2. The gating insight: the cores are *ahead of the host* + +The domain logic is mature. What is *not* yet production-grade is the **host trio** — and +that is the real critical path: + +- **host-on-sx** — HTTP / request-response / session host (briefing exists; the OCaml SX + HTTP server already serves `sx.rose-ash.com`). +- **host-persist** — durable storage adapter (real disk/pg/ipfs) under `persist`'s + facets (content-addressed blob blocker recently closed). +- **fed-sx** — the real ActivityPub transport every core injects (well into m2). + +> **So "when do we start?" answers itself: start when the host trio is production-grade, +> not when the cores are done — they mostly already are.** Prioritise the host loops over +> further domain features. + +--- + +## 3. The model: duplicate → cut over → diverge (per slice) + +This is the "duplicate first, then change" approach, made precise. Each domain slice goes +through three phases independently: + +**Phase A — Duplicate (hold logic to parity).** Stand the SX implementation of the slice +up *in parallel*, behind the existing edge, serving no users yet. Get its **domain/data +behaviour** to match Python (see §4 on how). Presentation can start as a rough port or an +early new design — it doesn't have to match. + +**Phase B — Cut over (strangler flip).** Point the edge route for that slice at the SX +host. Python stays as instant rollback. The slice is now live on SX. + +**Phase C — Diverge (change freely).** With the slice live and validated, evolve the +look/feel and functionality on the SX side. The validated domain logic underneath is +untouched, so UX/feature changes can't silently corrupt data. + +You never rewrite the whole platform at once; you walk slices through A→B→C, oldest tree +strangled last. + +--- + +## 4. The two techniques, and how "we'll change things" reshapes them + +### Strangler edge +The edge (Caddy) is the front door every request hits. Add routing rules so **one route +at a time** goes to the SX host while everything else still goes to Python. Properties: +the site is never half-broken; any single route flips back to Python instantly; the old +app is strangled route-by-route. (Opposite of big-bang swap, which is how these die.) + +### Shadow diff — split by layer +Run the new version on real traffic in the background, discard its output, and **log how +it differs** from Python. Flip the edge only when diffs are zero/intended. + +But because we *intend* to change look/feel + functionality, parity is a tool we apply +**only where we want sameness**, not a straitjacket: + +| Layer | Want parity? | Oracle | +|---|---|---| +| **Domain/data** (totals, tax, permissions, what's stored, who-sees-what) | **YES — silent difference = data corruption** | shadow-diff at the *core* boundary; deterministic cores → replay real request logs through the harness and diff | +| **Presentation/UX** (HTML, layout, look, feel, flows) | **NO — this is what we're changing** | manual QA + design review; this is the Phase-C divergence | + +Practical shape: shadow-diff hits the **domain core's output** (the computed order, the +visible-activity set, the permission decision) — not the rendered HTML. The deterministic, +harness-replayable cores are the single biggest advantage we have here; it's the same +parity discipline that made the A1 conformance migration safe (one reference slice, hard +parity gate, revert on mismatch). + +--- + +## 5. Readiness gates (start the production migration when ALL hold) + +1. **Host trio production-grade** — host-on-sx (HTTP/session), host-persist (durable + adapter), fed-sx (AP transport) — each conformance-green. +2. **Data-migration story exists** — a way to get existing production Postgres state into + `persist` event streams (event-source the current state, or dual-write during overlap). + This is the honest long-pole; it is *not* domain logic and nobody has built it yet. +3. **One vertical slice proven end-to-end** at data-parity in production — the reference + migration, the way the conformance loop migrated one subsystem before the rest. + +--- + +## 6. Sequencing + +1. **Host trio first** (critical path — it's behind the cores). +2. **Build the strangler edge + shadow-diff harness** as first-class tooling: edge routing + rules + a dual-run logger that diffs *core outputs* (not HTML) and stores discrepancies. +3. **First slice = lowest risk × highest readiness × cleanest data oracle.** + Recommended: **the blog read path (content-on-sx)** or **the feed read path** + — read-heavy, no money, CRDT/versioning + `page.sx` HTML already exist, and the data + oracle is clean. *Avoid cart/orders/payments first* (transactional + SumUp webhooks = + highest blast radius). +4. **Persistence-first, federation-last.** Land host-persist + migrate per-domain event + stores before any cutover. Do fed-sx federation as a *coordinated* cut near the end — + W1 shows all 7 cores light up federation together once the shared transport ships. +5. **Walk the remaining slices A→B→C**, retiring Python routes as each cuts over. + +--- + +## 7. The honest long tail (mostly host + adapters, not cores) + +The cores are pure domain logic; the production *tail* is not in them yet and is most of +the remaining real effort: + +- Auth: first-party cookies / Safari-ITP, CSRF, silent SSO, grant caching. +- Cross-cutting: rate limiting, observability/metrics, error pages, caching. +- Integrations: SumUp payment + webhooks, Ghost CMS sync. +- Presentation: the actual HTMX templates + CSS (this is also where the redesign happens). +- **Live data migration** — the single biggest non-core workstream. + +--- + +## 8. Concrete next steps + +1. Treat the **host trio** as the fleet's critical path; prioritise over more domain features. +2. Stand up the **strangler edge + core-level shadow-diff harness** as a tool. +3. Prove **one slice** (blog/content read path) end-to-end in production as the reference. +4. **Spec the Postgres → persist data migration** (the long-pole nobody has started). +5. Then walk slices through duplicate → cut over → diverge, redesigning UX in Phase C. + +--- + +## 9. Why this is low-risk despite being a platform rewrite + +- It's **wiring host-agnostic cores to a host**, not rewriting domain logic from scratch. +- The **strangler edge** means the site always works and any route reverts in seconds. +- **Deterministic cores** make data-parity *mechanically checkable* (replay + diff), so + correctness isn't a matter of faith. +- **Logic/presentation separation** lets us change look/feel + functionality (Phase C) + *without* re-risking the validated domain logic. +- It's the **same discipline that just shipped A1**: one reference migration, a hard + parity gate, honest exclusions, verify-before-merge.