radar: rose-ash-on-sx migration strategy — duplicate→cutover→diverge, strangler edge + layer-split shadow-diff, host-trio critical path
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 49s

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 15:04:04 +00:00
parent c0a0d29a65
commit 58aa9b64bf

View File

@@ -0,0 +1,170 @@
# Re-implementing rose-ash on SX — migration strategy
Status: **strategy proposal** (drafted by the `radar` loop, 2026-06-07). Not a
unilateral architecture decision — a starting point for the fleet to refine. Radar's
role here is detection: the `*-on-sx` subsystems have converged into a host-agnostic
re-implementation of rose-ash's domain logic, so this doc proposes *when* and *how* to
wire them to production.
---
## 1. Premise: we are ~70% into a re-implementation already
The fleet of `lib/<x>` SX subsystems is not a set of experiments — it is rose-ash's
domain logic, re-expressed substrate-by-substrate, deliberately **host-agnostic**:
| SX subsystem (`lib/`) | rose-ash production domain |
|---|---|
| content-on-sx (CRDT docs, versioning, `page.sx` HTML render) | **blog** |
| commerce-on-sx (catalog, pricing, cart, order + refund sagas) | **market + cart + orders** |
| events-on-sx (calendar, ticketing, booking) | **events** |
| feed-on-sx (activity streams, AP-shaped, threading) | **federation** |
| identity-on-sx (OAuth2, sessions, grants, membership) | **account** |
| acl-on-sx (permissions) | cross-cutting authZ |
| relations / likes | **relations / likes** (internal) |
| persist-on-sx (log / kv / snapshot facets) | per-service Postgres layer |
| flow-on-sx (durable sagas) | order/refund/delivery workflows |
| mod-on-sx, search-on-sx | new capabilities |
**The architectural enabler:** every core was built with *injected seams*`permit?`,
`send-fn`/`fetch-fn`, `transport`, `dispatch`, `backend`. That is ports-and-adapters
(hexagonal) on purpose. Evidence from the radar backlog (`plans/abstractions.md`):
W1 (7/7 federation modules inject the fed-sx transport), W4 (content/commerce/events run
live on `persist/log`), W8 (events+commerce run sagas on `lib/flow`). **The cores do not
depend on how they're hosted, persisted, or federated.**
**Corollary that makes the whole migration tractable:** because logic is separated from
rendering and storage, we can hold the **domain logic to parity** while **freely
redesigning the presentation** — the two are different layers with different rules.
---
## 2. The gating insight: the cores are *ahead of the host*
The domain logic is mature. What is *not* yet production-grade is the **host trio** — and
that is the real critical path:
- **host-on-sx** — HTTP / request-response / session host (briefing exists; the OCaml SX
HTTP server already serves `sx.rose-ash.com`).
- **host-persist** — durable storage adapter (real disk/pg/ipfs) under `persist`'s
facets (content-addressed blob blocker recently closed).
- **fed-sx** — the real ActivityPub transport every core injects (well into m2).
> **So "when do we start?" answers itself: start when the host trio is production-grade,
> not when the cores are done — they mostly already are.** Prioritise the host loops over
> further domain features.
---
## 3. The model: duplicate → cut over → diverge (per slice)
This is the "duplicate first, then change" approach, made precise. Each domain slice goes
through three phases independently:
**Phase A — Duplicate (hold logic to parity).** Stand the SX implementation of the slice
up *in parallel*, behind the existing edge, serving no users yet. Get its **domain/data
behaviour** to match Python (see §4 on how). Presentation can start as a rough port or an
early new design — it doesn't have to match.
**Phase B — Cut over (strangler flip).** Point the edge route for that slice at the SX
host. Python stays as instant rollback. The slice is now live on SX.
**Phase C — Diverge (change freely).** With the slice live and validated, evolve the
look/feel and functionality on the SX side. The validated domain logic underneath is
untouched, so UX/feature changes can't silently corrupt data.
You never rewrite the whole platform at once; you walk slices through A→B→C, oldest tree
strangled last.
---
## 4. The two techniques, and how "we'll change things" reshapes them
### Strangler edge
The edge (Caddy) is the front door every request hits. Add routing rules so **one route
at a time** goes to the SX host while everything else still goes to Python. Properties:
the site is never half-broken; any single route flips back to Python instantly; the old
app is strangled route-by-route. (Opposite of big-bang swap, which is how these die.)
### Shadow diff — split by layer
Run the new version on real traffic in the background, discard its output, and **log how
it differs** from Python. Flip the edge only when diffs are zero/intended.
But because we *intend* to change look/feel + functionality, parity is a tool we apply
**only where we want sameness**, not a straitjacket:
| Layer | Want parity? | Oracle |
|---|---|---|
| **Domain/data** (totals, tax, permissions, what's stored, who-sees-what) | **YES — silent difference = data corruption** | shadow-diff at the *core* boundary; deterministic cores → replay real request logs through the harness and diff |
| **Presentation/UX** (HTML, layout, look, feel, flows) | **NO — this is what we're changing** | manual QA + design review; this is the Phase-C divergence |
Practical shape: shadow-diff hits the **domain core's output** (the computed order, the
visible-activity set, the permission decision) — not the rendered HTML. The deterministic,
harness-replayable cores are the single biggest advantage we have here; it's the same
parity discipline that made the A1 conformance migration safe (one reference slice, hard
parity gate, revert on mismatch).
---
## 5. Readiness gates (start the production migration when ALL hold)
1. **Host trio production-grade** — host-on-sx (HTTP/session), host-persist (durable
adapter), fed-sx (AP transport) — each conformance-green.
2. **Data-migration story exists** — a way to get existing production Postgres state into
`persist` event streams (event-source the current state, or dual-write during overlap).
This is the honest long-pole; it is *not* domain logic and nobody has built it yet.
3. **One vertical slice proven end-to-end** at data-parity in production — the reference
migration, the way the conformance loop migrated one subsystem before the rest.
---
## 6. Sequencing
1. **Host trio first** (critical path — it's behind the cores).
2. **Build the strangler edge + shadow-diff harness** as first-class tooling: edge routing
rules + a dual-run logger that diffs *core outputs* (not HTML) and stores discrepancies.
3. **First slice = lowest risk × highest readiness × cleanest data oracle.**
Recommended: **the blog read path (content-on-sx)** or **the feed read path**
— read-heavy, no money, CRDT/versioning + `page.sx` HTML already exist, and the data
oracle is clean. *Avoid cart/orders/payments first* (transactional + SumUp webhooks =
highest blast radius).
4. **Persistence-first, federation-last.** Land host-persist + migrate per-domain event
stores before any cutover. Do fed-sx federation as a *coordinated* cut near the end —
W1 shows all 7 cores light up federation together once the shared transport ships.
5. **Walk the remaining slices A→B→C**, retiring Python routes as each cuts over.
---
## 7. The honest long tail (mostly host + adapters, not cores)
The cores are pure domain logic; the production *tail* is not in them yet and is most of
the remaining real effort:
- Auth: first-party cookies / Safari-ITP, CSRF, silent SSO, grant caching.
- Cross-cutting: rate limiting, observability/metrics, error pages, caching.
- Integrations: SumUp payment + webhooks, Ghost CMS sync.
- Presentation: the actual HTMX templates + CSS (this is also where the redesign happens).
- **Live data migration** — the single biggest non-core workstream.
---
## 8. Concrete next steps
1. Treat the **host trio** as the fleet's critical path; prioritise over more domain features.
2. Stand up the **strangler edge + core-level shadow-diff harness** as a tool.
3. Prove **one slice** (blog/content read path) end-to-end in production as the reference.
4. **Spec the Postgres → persist data migration** (the long-pole nobody has started).
5. Then walk slices through duplicate → cut over → diverge, redesigning UX in Phase C.
---
## 9. Why this is low-risk despite being a platform rewrite
- It's **wiring host-agnostic cores to a host**, not rewriting domain logic from scratch.
- The **strangler edge** means the site always works and any route reverts in seconds.
- **Deterministic cores** make data-parity *mechanically checkable* (replay + diff), so
correctness isn't a matter of faith.
- **Logic/presentation separation** lets us change look/feel + functionality (Phase C)
*without* re-risking the validated domain logic.
- It's the **same discipline that just shipped A1**: one reference migration, a hard
parity gate, honest exclusions, verify-before-merge.