Files
rose-ash/plans/rose-ash-on-sx-migration.md
giles b74eecfdd3
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 51s
plans: rose-ash-on-sx migration strategy + radar abstraction backlog (from loops/radar)
Surgical add of the two radar-authored planning docs onto architecture (both new
files, no conflict). Migration strategy: duplicate->cutover->diverge, strangler edge
+ layer-split shadow-diff, host-trio critical path. abstractions.md is the evidence
base the strategy cites (A1 done, W1/W4/W8 substrate-adoption findings).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:09:37 +00:00

171 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Re-implementing rose-ash on SX — migration strategy
Status: **strategy proposal** (drafted by the `radar` loop, 2026-06-07). Not a
unilateral architecture decision — a starting point for the fleet to refine. Radar's
role here is detection: the `*-on-sx` subsystems have converged into a host-agnostic
re-implementation of rose-ash's domain logic, so this doc proposes *when* and *how* to
wire them to production.
---
## 1. Premise: we are ~70% into a re-implementation already
The fleet of `lib/<x>` SX subsystems is not a set of experiments — it is rose-ash's
domain logic, re-expressed substrate-by-substrate, deliberately **host-agnostic**:
| SX subsystem (`lib/`) | rose-ash production domain |
|---|---|
| content-on-sx (CRDT docs, versioning, `page.sx` HTML render) | **blog** |
| commerce-on-sx (catalog, pricing, cart, order + refund sagas) | **market + cart + orders** |
| events-on-sx (calendar, ticketing, booking) | **events** |
| feed-on-sx (activity streams, AP-shaped, threading) | **federation** |
| identity-on-sx (OAuth2, sessions, grants, membership) | **account** |
| acl-on-sx (permissions) | cross-cutting authZ |
| relations / likes | **relations / likes** (internal) |
| persist-on-sx (log / kv / snapshot facets) | per-service Postgres layer |
| flow-on-sx (durable sagas) | order/refund/delivery workflows |
| mod-on-sx, search-on-sx | new capabilities |
**The architectural enabler:** every core was built with *injected seams*`permit?`,
`send-fn`/`fetch-fn`, `transport`, `dispatch`, `backend`. That is ports-and-adapters
(hexagonal) on purpose. Evidence from the radar backlog (`plans/abstractions.md`):
W1 (7/7 federation modules inject the fed-sx transport), W4 (content/commerce/events run
live on `persist/log`), W8 (events+commerce run sagas on `lib/flow`). **The cores do not
depend on how they're hosted, persisted, or federated.**
**Corollary that makes the whole migration tractable:** because logic is separated from
rendering and storage, we can hold the **domain logic to parity** while **freely
redesigning the presentation** — the two are different layers with different rules.
---
## 2. The gating insight: the cores are *ahead of the host*
The domain logic is mature. What is *not* yet production-grade is the **host trio** — and
that is the real critical path:
- **host-on-sx** — HTTP / request-response / session host (briefing exists; the OCaml SX
HTTP server already serves `sx.rose-ash.com`).
- **host-persist** — durable storage adapter (real disk/pg/ipfs) under `persist`'s
facets (content-addressed blob blocker recently closed).
- **fed-sx** — the real ActivityPub transport every core injects (well into m2).
> **So "when do we start?" answers itself: start when the host trio is production-grade,
> not when the cores are done — they mostly already are.** Prioritise the host loops over
> further domain features.
---
## 3. The model: duplicate → cut over → diverge (per slice)
This is the "duplicate first, then change" approach, made precise. Each domain slice goes
through three phases independently:
**Phase A — Duplicate (hold logic to parity).** Stand the SX implementation of the slice
up *in parallel*, behind the existing edge, serving no users yet. Get its **domain/data
behaviour** to match Python (see §4 on how). Presentation can start as a rough port or an
early new design — it doesn't have to match.
**Phase B — Cut over (strangler flip).** Point the edge route for that slice at the SX
host. Python stays as instant rollback. The slice is now live on SX.
**Phase C — Diverge (change freely).** With the slice live and validated, evolve the
look/feel and functionality on the SX side. The validated domain logic underneath is
untouched, so UX/feature changes can't silently corrupt data.
You never rewrite the whole platform at once; you walk slices through A→B→C, oldest tree
strangled last.
---
## 4. The two techniques, and how "we'll change things" reshapes them
### Strangler edge
The edge (Caddy) is the front door every request hits. Add routing rules so **one route
at a time** goes to the SX host while everything else still goes to Python. Properties:
the site is never half-broken; any single route flips back to Python instantly; the old
app is strangled route-by-route. (Opposite of big-bang swap, which is how these die.)
### Shadow diff — split by layer
Run the new version on real traffic in the background, discard its output, and **log how
it differs** from Python. Flip the edge only when diffs are zero/intended.
But because we *intend* to change look/feel + functionality, parity is a tool we apply
**only where we want sameness**, not a straitjacket:
| Layer | Want parity? | Oracle |
|---|---|---|
| **Domain/data** (totals, tax, permissions, what's stored, who-sees-what) | **YES — silent difference = data corruption** | shadow-diff at the *core* boundary; deterministic cores → replay real request logs through the harness and diff |
| **Presentation/UX** (HTML, layout, look, feel, flows) | **NO — this is what we're changing** | manual QA + design review; this is the Phase-C divergence |
Practical shape: shadow-diff hits the **domain core's output** (the computed order, the
visible-activity set, the permission decision) — not the rendered HTML. The deterministic,
harness-replayable cores are the single biggest advantage we have here; it's the same
parity discipline that made the A1 conformance migration safe (one reference slice, hard
parity gate, revert on mismatch).
---
## 5. Readiness gates (start the production migration when ALL hold)
1. **Host trio production-grade** — host-on-sx (HTTP/session), host-persist (durable
adapter), fed-sx (AP transport) — each conformance-green.
2. **Data-migration story exists** — a way to get existing production Postgres state into
`persist` event streams (event-source the current state, or dual-write during overlap).
This is the honest long-pole; it is *not* domain logic and nobody has built it yet.
3. **One vertical slice proven end-to-end** at data-parity in production — the reference
migration, the way the conformance loop migrated one subsystem before the rest.
---
## 6. Sequencing
1. **Host trio first** (critical path — it's behind the cores).
2. **Build the strangler edge + shadow-diff harness** as first-class tooling: edge routing
rules + a dual-run logger that diffs *core outputs* (not HTML) and stores discrepancies.
3. **First slice = lowest risk × highest readiness × cleanest data oracle.**
Recommended: **the blog read path (content-on-sx)** or **the feed read path**
— read-heavy, no money, CRDT/versioning + `page.sx` HTML already exist, and the data
oracle is clean. *Avoid cart/orders/payments first* (transactional + SumUp webhooks =
highest blast radius).
4. **Persistence-first, federation-last.** Land host-persist + migrate per-domain event
stores before any cutover. Do fed-sx federation as a *coordinated* cut near the end —
W1 shows all 7 cores light up federation together once the shared transport ships.
5. **Walk the remaining slices A→B→C**, retiring Python routes as each cuts over.
---
## 7. The honest long tail (mostly host + adapters, not cores)
The cores are pure domain logic; the production *tail* is not in them yet and is most of
the remaining real effort:
- Auth: first-party cookies / Safari-ITP, CSRF, silent SSO, grant caching.
- Cross-cutting: rate limiting, observability/metrics, error pages, caching.
- Integrations: SumUp payment + webhooks, Ghost CMS sync.
- Presentation: the actual HTMX templates + CSS (this is also where the redesign happens).
- **Live data migration** — the single biggest non-core workstream.
---
## 8. Concrete next steps
1. Treat the **host trio** as the fleet's critical path; prioritise over more domain features.
2. Stand up the **strangler edge + core-level shadow-diff harness** as a tool.
3. Prove **one slice** (blog/content read path) end-to-end in production as the reference.
4. **Spec the Postgres → persist data migration** (the long-pole nobody has started).
5. Then walk slices through duplicate → cut over → diverge, redesigning UX in Phase C.
---
## 9. Why this is low-risk despite being a platform rewrite
- It's **wiring host-agnostic cores to a host**, not rewriting domain logic from scratch.
- The **strangler edge** means the site always works and any route reverts in seconds.
- **Deterministic cores** make data-parity *mechanically checkable* (replay + diff), so
correctness isn't a matter of faith.
- **Logic/presentation separation** lets us change look/feel + functionality (Phase C)
*without* re-risking the validated domain logic.
- It's the **same discipline that just shipped A1**: one reference migration, a hard
parity gate, honest exclusions, verify-before-merge.