radar: rose-ash-on-sx migration strategy — duplicate→cutover→diverge, strangler edge + layer-split shadow-diff, host-trio critical path
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 49s
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 49s
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
170
plans/rose-ash-on-sx-migration.md
Normal file
170
plans/rose-ash-on-sx-migration.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Re-implementing rose-ash on SX — migration strategy
|
||||
|
||||
Status: **strategy proposal** (drafted by the `radar` loop, 2026-06-07). Not a
|
||||
unilateral architecture decision — a starting point for the fleet to refine. Radar's
|
||||
role here is detection: the `*-on-sx` subsystems have converged into a host-agnostic
|
||||
re-implementation of rose-ash's domain logic, so this doc proposes *when* and *how* to
|
||||
wire them to production.
|
||||
|
||||
---
|
||||
|
||||
## 1. Premise: we are ~70% into a re-implementation already
|
||||
|
||||
The fleet of `lib/<x>` SX subsystems is not a set of experiments — it is rose-ash's
|
||||
domain logic, re-expressed substrate-by-substrate, deliberately **host-agnostic**:
|
||||
|
||||
| SX subsystem (`lib/`) | rose-ash production domain |
|
||||
|---|---|
|
||||
| content-on-sx (CRDT docs, versioning, `page.sx` HTML render) | **blog** |
|
||||
| commerce-on-sx (catalog, pricing, cart, order + refund sagas) | **market + cart + orders** |
|
||||
| events-on-sx (calendar, ticketing, booking) | **events** |
|
||||
| feed-on-sx (activity streams, AP-shaped, threading) | **federation** |
|
||||
| identity-on-sx (OAuth2, sessions, grants, membership) | **account** |
|
||||
| acl-on-sx (permissions) | cross-cutting authZ |
|
||||
| relations / likes | **relations / likes** (internal) |
|
||||
| persist-on-sx (log / kv / snapshot facets) | per-service Postgres layer |
|
||||
| flow-on-sx (durable sagas) | order/refund/delivery workflows |
|
||||
| mod-on-sx, search-on-sx | new capabilities |
|
||||
|
||||
**The architectural enabler:** every core was built with *injected seams* — `permit?`,
|
||||
`send-fn`/`fetch-fn`, `transport`, `dispatch`, `backend`. That is ports-and-adapters
|
||||
(hexagonal) on purpose. Evidence from the radar backlog (`plans/abstractions.md`):
|
||||
W1 (7/7 federation modules inject the fed-sx transport), W4 (content/commerce/events run
|
||||
live on `persist/log`), W8 (events+commerce run sagas on `lib/flow`). **The cores do not
|
||||
depend on how they're hosted, persisted, or federated.**
|
||||
|
||||
**Corollary that makes the whole migration tractable:** because logic is separated from
|
||||
rendering and storage, we can hold the **domain logic to parity** while **freely
|
||||
redesigning the presentation** — the two are different layers with different rules.
|
||||
|
||||
---
|
||||
|
||||
## 2. The gating insight: the cores are *ahead of the host*
|
||||
|
||||
The domain logic is mature. What is *not* yet production-grade is the **host trio** — and
|
||||
that is the real critical path:
|
||||
|
||||
- **host-on-sx** — HTTP / request-response / session host (briefing exists; the OCaml SX
|
||||
HTTP server already serves `sx.rose-ash.com`).
|
||||
- **host-persist** — durable storage adapter (real disk/pg/ipfs) under `persist`'s
|
||||
facets (content-addressed blob blocker recently closed).
|
||||
- **fed-sx** — the real ActivityPub transport every core injects (well into m2).
|
||||
|
||||
> **So "when do we start?" answers itself: start when the host trio is production-grade,
|
||||
> not when the cores are done — they mostly already are.** Prioritise the host loops over
|
||||
> further domain features.
|
||||
|
||||
---
|
||||
|
||||
## 3. The model: duplicate → cut over → diverge (per slice)
|
||||
|
||||
This is the "duplicate first, then change" approach, made precise. Each domain slice goes
|
||||
through three phases independently:
|
||||
|
||||
**Phase A — Duplicate (hold logic to parity).** Stand the SX implementation of the slice
|
||||
up *in parallel*, behind the existing edge, serving no users yet. Get its **domain/data
|
||||
behaviour** to match Python (see §4 on how). Presentation can start as a rough port or an
|
||||
early new design — it doesn't have to match.
|
||||
|
||||
**Phase B — Cut over (strangler flip).** Point the edge route for that slice at the SX
|
||||
host. Python stays as instant rollback. The slice is now live on SX.
|
||||
|
||||
**Phase C — Diverge (change freely).** With the slice live and validated, evolve the
|
||||
look/feel and functionality on the SX side. The validated domain logic underneath is
|
||||
untouched, so UX/feature changes can't silently corrupt data.
|
||||
|
||||
You never rewrite the whole platform at once; you walk slices through A→B→C, oldest tree
|
||||
strangled last.
|
||||
|
||||
---
|
||||
|
||||
## 4. The two techniques, and how "we'll change things" reshapes them
|
||||
|
||||
### Strangler edge
|
||||
The edge (Caddy) is the front door every request hits. Add routing rules so **one route
|
||||
at a time** goes to the SX host while everything else still goes to Python. Properties:
|
||||
the site is never half-broken; any single route flips back to Python instantly; the old
|
||||
app is strangled route-by-route. (Opposite of big-bang swap, which is how these die.)
|
||||
|
||||
### Shadow diff — split by layer
|
||||
Run the new version on real traffic in the background, discard its output, and **log how
|
||||
it differs** from Python. Flip the edge only when diffs are zero/intended.
|
||||
|
||||
But because we *intend* to change look/feel + functionality, parity is a tool we apply
|
||||
**only where we want sameness**, not a straitjacket:
|
||||
|
||||
| Layer | Want parity? | Oracle |
|
||||
|---|---|---|
|
||||
| **Domain/data** (totals, tax, permissions, what's stored, who-sees-what) | **YES — silent difference = data corruption** | shadow-diff at the *core* boundary; deterministic cores → replay real request logs through the harness and diff |
|
||||
| **Presentation/UX** (HTML, layout, look, feel, flows) | **NO — this is what we're changing** | manual QA + design review; this is the Phase-C divergence |
|
||||
|
||||
Practical shape: shadow-diff hits the **domain core's output** (the computed order, the
|
||||
visible-activity set, the permission decision) — not the rendered HTML. The deterministic,
|
||||
harness-replayable cores are the single biggest advantage we have here; it's the same
|
||||
parity discipline that made the A1 conformance migration safe (one reference slice, hard
|
||||
parity gate, revert on mismatch).
|
||||
|
||||
---
|
||||
|
||||
## 5. Readiness gates (start the production migration when ALL hold)
|
||||
|
||||
1. **Host trio production-grade** — host-on-sx (HTTP/session), host-persist (durable
|
||||
adapter), fed-sx (AP transport) — each conformance-green.
|
||||
2. **Data-migration story exists** — a way to get existing production Postgres state into
|
||||
`persist` event streams (event-source the current state, or dual-write during overlap).
|
||||
This is the honest long-pole; it is *not* domain logic and nobody has built it yet.
|
||||
3. **One vertical slice proven end-to-end** at data-parity in production — the reference
|
||||
migration, the way the conformance loop migrated one subsystem before the rest.
|
||||
|
||||
---
|
||||
|
||||
## 6. Sequencing
|
||||
|
||||
1. **Host trio first** (critical path — it's behind the cores).
|
||||
2. **Build the strangler edge + shadow-diff harness** as first-class tooling: edge routing
|
||||
rules + a dual-run logger that diffs *core outputs* (not HTML) and stores discrepancies.
|
||||
3. **First slice = lowest risk × highest readiness × cleanest data oracle.**
|
||||
Recommended: **the blog read path (content-on-sx)** or **the feed read path**
|
||||
— read-heavy, no money, CRDT/versioning + `page.sx` HTML already exist, and the data
|
||||
oracle is clean. *Avoid cart/orders/payments first* (transactional + SumUp webhooks =
|
||||
highest blast radius).
|
||||
4. **Persistence-first, federation-last.** Land host-persist + migrate per-domain event
|
||||
stores before any cutover. Do fed-sx federation as a *coordinated* cut near the end —
|
||||
W1 shows all 7 cores light up federation together once the shared transport ships.
|
||||
5. **Walk the remaining slices A→B→C**, retiring Python routes as each cuts over.
|
||||
|
||||
---
|
||||
|
||||
## 7. The honest long tail (mostly host + adapters, not cores)
|
||||
|
||||
The cores are pure domain logic; the production *tail* is not in them yet and is most of
|
||||
the remaining real effort:
|
||||
|
||||
- Auth: first-party cookies / Safari-ITP, CSRF, silent SSO, grant caching.
|
||||
- Cross-cutting: rate limiting, observability/metrics, error pages, caching.
|
||||
- Integrations: SumUp payment + webhooks, Ghost CMS sync.
|
||||
- Presentation: the actual HTMX templates + CSS (this is also where the redesign happens).
|
||||
- **Live data migration** — the single biggest non-core workstream.
|
||||
|
||||
---
|
||||
|
||||
## 8. Concrete next steps
|
||||
|
||||
1. Treat the **host trio** as the fleet's critical path; prioritise over more domain features.
|
||||
2. Stand up the **strangler edge + core-level shadow-diff harness** as a tool.
|
||||
3. Prove **one slice** (blog/content read path) end-to-end in production as the reference.
|
||||
4. **Spec the Postgres → persist data migration** (the long-pole nobody has started).
|
||||
5. Then walk slices through duplicate → cut over → diverge, redesigning UX in Phase C.
|
||||
|
||||
---
|
||||
|
||||
## 9. Why this is low-risk despite being a platform rewrite
|
||||
|
||||
- It's **wiring host-agnostic cores to a host**, not rewriting domain logic from scratch.
|
||||
- The **strangler edge** means the site always works and any route reverts in seconds.
|
||||
- **Deterministic cores** make data-parity *mechanically checkable* (replay + diff), so
|
||||
correctness isn't a matter of faith.
|
||||
- **Logic/presentation separation** lets us change look/feel + functionality (Phase C)
|
||||
*without* re-risking the validated domain logic.
|
||||
- It's the **same discipline that just shipped A1**: one reference migration, a hard
|
||||
parity gate, honest exclusions, verify-before-merge.
|
||||
Reference in New Issue
Block a user