Files
rose-ash/plans/rose-ash-on-sx-migration.md
giles b74eecfdd3
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 51s
plans: rose-ash-on-sx migration strategy + radar abstraction backlog (from loops/radar)
Surgical add of the two radar-authored planning docs onto architecture (both new
files, no conflict). Migration strategy: duplicate->cutover->diverge, strangler edge
+ layer-split shadow-diff, host-trio critical path. abstractions.md is the evidence
base the strategy cites (A1 done, W1/W4/W8 substrate-adoption findings).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:09:37 +00:00

8.7 KiB
Raw Blame History

Re-implementing rose-ash on SX — migration strategy

Status: strategy proposal (drafted by the radar loop, 2026-06-07). Not a unilateral architecture decision — a starting point for the fleet to refine. Radar's role here is detection: the *-on-sx subsystems have converged into a host-agnostic re-implementation of rose-ash's domain logic, so this doc proposes when and how to wire them to production.


1. Premise: we are ~70% into a re-implementation already

The fleet of lib/<x> SX subsystems is not a set of experiments — it is rose-ash's domain logic, re-expressed substrate-by-substrate, deliberately host-agnostic:

SX subsystem (lib/) rose-ash production domain
content-on-sx (CRDT docs, versioning, page.sx HTML render) blog
commerce-on-sx (catalog, pricing, cart, order + refund sagas) market + cart + orders
events-on-sx (calendar, ticketing, booking) events
feed-on-sx (activity streams, AP-shaped, threading) federation
identity-on-sx (OAuth2, sessions, grants, membership) account
acl-on-sx (permissions) cross-cutting authZ
relations / likes relations / likes (internal)
persist-on-sx (log / kv / snapshot facets) per-service Postgres layer
flow-on-sx (durable sagas) order/refund/delivery workflows
mod-on-sx, search-on-sx new capabilities

The architectural enabler: every core was built with injected seamspermit?, send-fn/fetch-fn, transport, dispatch, backend. That is ports-and-adapters (hexagonal) on purpose. Evidence from the radar backlog (plans/abstractions.md): W1 (7/7 federation modules inject the fed-sx transport), W4 (content/commerce/events run live on persist/log), W8 (events+commerce run sagas on lib/flow). The cores do not depend on how they're hosted, persisted, or federated.

Corollary that makes the whole migration tractable: because logic is separated from rendering and storage, we can hold the domain logic to parity while freely redesigning the presentation — the two are different layers with different rules.


2. The gating insight: the cores are ahead of the host

The domain logic is mature. What is not yet production-grade is the host trio — and that is the real critical path:

  • host-on-sx — HTTP / request-response / session host (briefing exists; the OCaml SX HTTP server already serves sx.rose-ash.com).
  • host-persist — durable storage adapter (real disk/pg/ipfs) under persist's facets (content-addressed blob blocker recently closed).
  • fed-sx — the real ActivityPub transport every core injects (well into m2).

So "when do we start?" answers itself: start when the host trio is production-grade, not when the cores are done — they mostly already are. Prioritise the host loops over further domain features.


3. The model: duplicate → cut over → diverge (per slice)

This is the "duplicate first, then change" approach, made precise. Each domain slice goes through three phases independently:

Phase A — Duplicate (hold logic to parity). Stand the SX implementation of the slice up in parallel, behind the existing edge, serving no users yet. Get its domain/data behaviour to match Python (see §4 on how). Presentation can start as a rough port or an early new design — it doesn't have to match.

Phase B — Cut over (strangler flip). Point the edge route for that slice at the SX host. Python stays as instant rollback. The slice is now live on SX.

Phase C — Diverge (change freely). With the slice live and validated, evolve the look/feel and functionality on the SX side. The validated domain logic underneath is untouched, so UX/feature changes can't silently corrupt data.

You never rewrite the whole platform at once; you walk slices through A→B→C, oldest tree strangled last.


4. The two techniques, and how "we'll change things" reshapes them

Strangler edge

The edge (Caddy) is the front door every request hits. Add routing rules so one route at a time goes to the SX host while everything else still goes to Python. Properties: the site is never half-broken; any single route flips back to Python instantly; the old app is strangled route-by-route. (Opposite of big-bang swap, which is how these die.)

Shadow diff — split by layer

Run the new version on real traffic in the background, discard its output, and log how it differs from Python. Flip the edge only when diffs are zero/intended.

But because we intend to change look/feel + functionality, parity is a tool we apply only where we want sameness, not a straitjacket:

Layer Want parity? Oracle
Domain/data (totals, tax, permissions, what's stored, who-sees-what) YES — silent difference = data corruption shadow-diff at the core boundary; deterministic cores → replay real request logs through the harness and diff
Presentation/UX (HTML, layout, look, feel, flows) NO — this is what we're changing manual QA + design review; this is the Phase-C divergence

Practical shape: shadow-diff hits the domain core's output (the computed order, the visible-activity set, the permission decision) — not the rendered HTML. The deterministic, harness-replayable cores are the single biggest advantage we have here; it's the same parity discipline that made the A1 conformance migration safe (one reference slice, hard parity gate, revert on mismatch).


5. Readiness gates (start the production migration when ALL hold)

  1. Host trio production-grade — host-on-sx (HTTP/session), host-persist (durable adapter), fed-sx (AP transport) — each conformance-green.
  2. Data-migration story exists — a way to get existing production Postgres state into persist event streams (event-source the current state, or dual-write during overlap). This is the honest long-pole; it is not domain logic and nobody has built it yet.
  3. One vertical slice proven end-to-end at data-parity in production — the reference migration, the way the conformance loop migrated one subsystem before the rest.

6. Sequencing

  1. Host trio first (critical path — it's behind the cores).
  2. Build the strangler edge + shadow-diff harness as first-class tooling: edge routing rules + a dual-run logger that diffs core outputs (not HTML) and stores discrepancies.
  3. First slice = lowest risk × highest readiness × cleanest data oracle. Recommended: the blog read path (content-on-sx) or the feed read path — read-heavy, no money, CRDT/versioning + page.sx HTML already exist, and the data oracle is clean. Avoid cart/orders/payments first (transactional + SumUp webhooks = highest blast radius).
  4. Persistence-first, federation-last. Land host-persist + migrate per-domain event stores before any cutover. Do fed-sx federation as a coordinated cut near the end — W1 shows all 7 cores light up federation together once the shared transport ships.
  5. Walk the remaining slices A→B→C, retiring Python routes as each cuts over.

7. The honest long tail (mostly host + adapters, not cores)

The cores are pure domain logic; the production tail is not in them yet and is most of the remaining real effort:

  • Auth: first-party cookies / Safari-ITP, CSRF, silent SSO, grant caching.
  • Cross-cutting: rate limiting, observability/metrics, error pages, caching.
  • Integrations: SumUp payment + webhooks, Ghost CMS sync.
  • Presentation: the actual HTMX templates + CSS (this is also where the redesign happens).
  • Live data migration — the single biggest non-core workstream.

8. Concrete next steps

  1. Treat the host trio as the fleet's critical path; prioritise over more domain features.
  2. Stand up the strangler edge + core-level shadow-diff harness as a tool.
  3. Prove one slice (blog/content read path) end-to-end in production as the reference.
  4. Spec the Postgres → persist data migration (the long-pole nobody has started).
  5. Then walk slices through duplicate → cut over → diverge, redesigning UX in Phase C.

9. Why this is low-risk despite being a platform rewrite

  • It's wiring host-agnostic cores to a host, not rewriting domain logic from scratch.
  • The strangler edge means the site always works and any route reverts in seconds.
  • Deterministic cores make data-parity mechanically checkable (replay + diff), so correctness isn't a matter of faith.
  • Logic/presentation separation lets us change look/feel + functionality (Phase C) without re-risking the validated domain logic.
  • It's the same discipline that just shipped A1: one reference migration, a hard parity gate, honest exclusions, verify-before-merge.