From 1d771aedea095b6a80e5609e9fd671b6123c9cce Mon Sep 17 00:00:00 2001 From: giles Date: Sun, 7 Jun 2026 15:21:18 +0000 Subject: [PATCH] fed-sx-m2: Pattern B from fed-prims diagnosis fails on reproducer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit loops/fed-prims commit bf8d0bf2 (merged as 94f6ab9f) diagnosed Blockers #4 as Erlang-substrate scope and sketched a Pattern B fix purely in er-bif-http-listen: wrap the handler call in er-spawn-fun + er-sched-run-all! and read the spawned process's :exit-result. Tried it on lib/erlang/runtime.sx — does not work. Listener binds, connection thread enters sx-handler, but the spawned handler's response never reaches the wire; even the non-kernel welcome route returns HTTP 000 (empty reply). Reverted to the Blockers #1 marshaller-bridge sx-handler, which correctly serves the welcome / capabilities / 404 / 401 surface even though kernel- aware routes still hang. Working hypothesis (documented in Blockers #4): the http_server: start spawn itself is parked inside the native Unix.accept loop on the boot thread; the global er-sched-* state still has that process in its queue. When the connection thread (under the per-instance native mutex) calls er-sched-run-all!, it re-enters the SAME global scheduler — the boot thread's er-sched-step! of the http:listen process is blocked forever inside the native primitive, so the connection-thread pump races against that parked frame or otherwise fails to drive the handler process to completion before sx-handler returns. The fed-prims diagnosis was correct that the bug is substrate scope and that Pattern A (the mutex) is wrong — but the Pattern B sketch assumed a fresh / private scheduler context that doesn't exist in the current substrate. Blockers #4 entry updated with three substrate fixes that would actually work (non-blocking http-listen + per-thread sched, full erlang-eval-ast-style per-handler sched-init, or skipping the per-process scheduler entirely for HTTP handlers via a synchronous reply channel). m2 stays at 11/12 steps done; Step 12 remains gated. Loop pacing dialled back down — substrate work owes to loops/erlang or a follow-on fed-prims tick with a more careful design pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- plans/fed-sx-milestone-2.md | 94 ++++++++++++++++++++++++++++++++----- 1 file changed, 83 insertions(+), 11 deletions(-) diff --git a/plans/fed-sx-milestone-2.md b/plans/fed-sx-milestone-2.md index a945ada4..c8f79d6e 100644 --- a/plans/fed-sx-milestone-2.md +++ b/plans/fed-sx-milestone-2.md @@ -1105,19 +1105,60 @@ proceed. the first GET /actors//outbox (or any /actors/ with `Accept: application/vnd.fed-sx.actor-doc`). - Belongs on `loops/erlang` or `loops/fed-prims`. Two fix - patterns: - - Release the mutex around the `gen_server:call` reply wait - (substrate change in http-listen's handler-call code). - - Run the handler in a fresh er-spawn'd process so the - gen_server runs on a different scheduler frame. + **2026-06-07 update:** `loops/fed-prims` commit `bf8d0bf2` + (merged to architecture as `94f6ab9f`) diagnosed this as + Erlang-substrate scope rather than an OCaml mutex bug, and + sketched a Pattern B fix entirely in `er-bif-http-listen`: + wrap the handler call in `er-spawn-fun` + `er-sched-run-all!` + and read the process's `:exit-result`. m2 tried this patch on + `lib/erlang/runtime.sx` and **it did not work**: the listener + binds, the connection thread enters `sx-handler`, but the + spawned process's response never reaches the wire — even the + non-kernel welcome route returns `HTTP 000` (empty reply). + Reproducer: spin up `http_server:start(P, [])` with the + Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000. - Step 12's two-instance smoke test gates on this — without - it, the only request shapes that survive over real HTTP are - the static / capabilities / static-stub paths. + Why it fails (working hypothesis, m2 worktree): the + `http_server:start` spawn itself ran inside the outer + `erlang-eval-ast` scheduler pump and is **parked inside the + native `Unix.accept` loop on the boot thread**; the global + `er-sched-*` state still has that process in its queue. When + the connection thread calls `er-sched-run-all!` from inside + `sx-handler`, it re-enters the SAME global scheduler that + the boot thread is already pumping (the boot thread's + `er-sched-step!` of the http:listen process is blocked + forever inside the native primitive). The connection thread + spawns its handler process fine but `er-sched-run-all!` + either races against the boot thread's parked pump or + otherwise fails to drive the handler to completion before + the native handler returns. Reverted on m2 — `lib/erlang/ + runtime.sx` stays at the Blockers #1 marshaller-bridge fix, + which is correct. - In-flight `smoke_federate.sh` test was withdrawn during this - tick after the deadlock surfaced (it boots both instances + The real fix likely needs ONE of: + - Native http-listen registers the listener and returns + immediately (non-blocking BIF), with the accept loop + running on a separate native thread and the connection + handler entering a **fresh** `er-sched-init!`-d + scheduler context (substrate change in OCaml + a redesign + of how er-sched-* state is partitioned by thread). + - OR: the connection handler runs `erlang-eval-ast`-style + (its own `er-sched-init!` + private scheduler), with the + gen_server hosted in a way that's accessible across + scheduler instances (substantial substrate redesign). + - OR: skip the per-process scheduler entirely for HTTP + handlers and use a synchronous "reply channel" pattern + that doesn't go through `receive` (changes every + kernel-aware Erlang module's call shape — large blast + radius). + + Belongs on `loops/erlang` or a follow-on `loops/fed-prims` + tick. Step 12's two-instance smoke test gates on this — + without it, the only request shapes that survive over real + HTTP are the static / capabilities / static-stub paths. + + In-flight `smoke_federate.sh` test was withdrawn during the + initial Blockers #4 surfacing (it boots both instances successfully but every kernel-touching request hangs); the plan's Step 12 acceptance criterion stays open pending Blockers #4 resolution. m2's other 11 steps are fully @@ -1129,6 +1170,37 @@ proceed. Newest first. +- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B + patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`: + wrap the handler call in `er-spawn-fun` + `er-sched-run-all!` + and read the spawned process's `:exit-result`. **It did not + work** — listener binds, but even the non-kernel welcome route + now returns HTTP 000 (the spawned handler's response never + reaches the wire). The simple `sx-handler` (direct + `er-apply-fun handler`) is preserved on m2 because it at least + serves welcome / capabilities / 404 / 401 correctly when no + kernel routes are touched. Reverted; runtime.sx stays at the + Blockers #1 marshaller-bridge fix. + + Working hypothesis for why Pattern B fails on m2's + reproducer: the `http_server:start` spawn is itself parked + inside the native `Unix.accept` loop on the boot thread; the + global `er-sched-*` state still has that process in its + queue. When the connection thread (under the per-instance + native mutex) calls `er-sched-run-all!`, it re-enters the + SAME global scheduler — the boot thread's `er-sched-step!` + of the http:listen process is blocked forever inside the + native primitive, so the connection-thread pump either + races against that parked frame or otherwise fails to drive + the new handler process to completion before the connection + thread returns from `sx-handler`. The fed-prims diagnosis + was correct that the bug is Erlang-substrate scope and that + Pattern A (the mutex) doesn't apply, but the Pattern B + sketch assumed a fresh / private scheduler context that + doesn't exist in the current substrate. Blockers #4 + updated to capture this + sketch the three substrate fixes + that would actually work; loop pacing dialled back down. + - **2026-06-07** — Step 12 prep discovered Blockers #4 (http-listen handler holds the SX runtime mutex; any `gen_server:call` from inside an HTTP route deadlocks