fed-sx-m2: Pattern B from fed-prims diagnosis fails on reproducer

loops/fed-prims commit bf8d0bf2 (merged as 94f6ab9f) diagnosed Blockers #4 as Erlang-substrate scope and sketched a Pattern B fix purely in er-bif-http-listen: wrap the handler call in er-spawn-fun + er-sched-run-all! and read the spawned process's :exit-result. Tried it on lib/erlang/runtime.sx — does not work. Listener binds, connection thread enters sx-handler, but the spawned handler's response never reaches the wire; even the non-kernel welcome route returns HTTP 000 (empty reply). Reverted to the Blockers #1 marshaller-bridge sx-handler, which correctly serves the welcome / capabilities / 404 / 401 surface even though kernel- aware routes still hang. Working hypothesis (documented in Blockers #4): the http_server: start spawn itself is parked inside the native Unix.accept loop on the boot thread; the global er-sched-* state still has that process in its queue. When the connection thread (under the per-instance native mutex) calls er-sched-run-all!, it re-enters the SAME global scheduler — the boot thread's er-sched-step! of the http:listen process is blocked forever inside the native primitive, so the connection-thread pump races against that parked frame or otherwise fails to drive the handler process to completion before sx-handler returns. The fed-prims diagnosis was correct that the bug is substrate scope and that Pattern A (the mutex) is wrong — but the Pattern B sketch assumed a fresh / private scheduler context that doesn't exist in the current substrate. Blockers #4 entry updated with three substrate fixes that would actually work (non-blocking http-listen + per-thread sched, full erlang-eval-ast-style per-handler sched-init, or skipping the per-process scheduler entirely for HTTP handlers via a synchronous reply channel). m2 stays at 11/12 steps done; Step 12 remains gated. Loop pacing dialled back down — substrate work owes to loops/erlang or a follow-on fed-prims tick with a more careful design pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 15:21:18 +00:00
parent 136deb1daf
commit 1d771aedea
1 changed files with 83 additions and 11 deletions
--- a/plans/fed-sx-milestone-2.md
+++ b/plans/fed-sx-milestone-2.md
@@ -1105,19 +1105,60 @@ proceed.
   the first GET /actors/<id>/outbox (or any /actors/<id> with
   `Accept: application/vnd.fed-sx.actor-doc`).

-   Belongs on `loops/erlang` or `loops/fed-prims`. Two fix
-   patterns:
-   - Release the mutex around the `gen_server:call` reply wait
-     (substrate change in http-listen's handler-call code).
-   - Run the handler in a fresh er-spawn'd process so the
-     gen_server runs on a different scheduler frame.
+   **2026-06-07 update:** `loops/fed-prims` commit `bf8d0bf2`
+   (merged to architecture as `94f6ab9f`) diagnosed this as
+   Erlang-substrate scope rather than an OCaml mutex bug, and
+   sketched a Pattern B fix entirely in `er-bif-http-listen`:
+   wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
+   and read the process's `:exit-result`. m2 tried this patch on
+   `lib/erlang/runtime.sx` and **it did not work**: the listener
+   binds, the connection thread enters `sx-handler`, but the
+   spawned process's response never reaches the wire — even the
+   non-kernel welcome route returns `HTTP 000` (empty reply).
+   Reproducer: spin up `http_server:start(P, [])` with the
+   Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000.

-   Step 12's two-instance smoke test gates on this — without
-   it, the only request shapes that survive over real HTTP are
-   the static / capabilities / static-stub paths.
+   Why it fails (working hypothesis, m2 worktree): the
+   `http_server:start` spawn itself ran inside the outer
+   `erlang-eval-ast` scheduler pump and is **parked inside the
+   native `Unix.accept` loop on the boot thread**; the global
+   `er-sched-*` state still has that process in its queue. When
+   the connection thread calls `er-sched-run-all!` from inside
+   `sx-handler`, it re-enters the SAME global scheduler that
+   the boot thread is already pumping (the boot thread's
+   `er-sched-step!` of the http:listen process is blocked
+   forever inside the native primitive). The connection thread
+   spawns its handler process fine but `er-sched-run-all!`
+   either races against the boot thread's parked pump or
+   otherwise fails to drive the handler to completion before
+   the native handler returns. Reverted on m2 — `lib/erlang/
+   runtime.sx` stays at the Blockers #1 marshaller-bridge fix,
+   which is correct.

-   In-flight `smoke_federate.sh` test was withdrawn during this
-   tick after the deadlock surfaced (it boots both instances
+   The real fix likely needs ONE of:
+   - Native http-listen registers the listener and returns
+     immediately (non-blocking BIF), with the accept loop
+     running on a separate native thread and the connection
+     handler entering a **fresh** `er-sched-init!`-d
+     scheduler context (substrate change in OCaml + a redesign
+     of how er-sched-* state is partitioned by thread).
+   - OR: the connection handler runs `erlang-eval-ast`-style
+     (its own `er-sched-init!` + private scheduler), with the
+     gen_server hosted in a way that's accessible across
+     scheduler instances (substantial substrate redesign).
+   - OR: skip the per-process scheduler entirely for HTTP
+     handlers and use a synchronous "reply channel" pattern
+     that doesn't go through `receive` (changes every
+     kernel-aware Erlang module's call shape — large blast
+     radius).
+
+   Belongs on `loops/erlang` or a follow-on `loops/fed-prims`
+   tick. Step 12's two-instance smoke test gates on this —
+   without it, the only request shapes that survive over real
+   HTTP are the static / capabilities / static-stub paths.
+
+   In-flight `smoke_federate.sh` test was withdrawn during the
+   initial Blockers #4 surfacing (it boots both instances
   successfully but every kernel-touching request hangs); the
   plan's Step 12 acceptance criterion stays open pending
   Blockers #4 resolution. m2's other 11 steps are fully
@@ -1129,6 +1170,37 @@ proceed.

 Newest first.

+- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B
+  patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`:
+  wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
+  and read the spawned process's `:exit-result`. **It did not
+  work** — listener binds, but even the non-kernel welcome route
+  now returns HTTP 000 (the spawned handler's response never
+  reaches the wire). The simple `sx-handler` (direct
+  `er-apply-fun handler`) is preserved on m2 because it at least
+  serves welcome / capabilities / 404 / 401 correctly when no
+  kernel routes are touched. Reverted; runtime.sx stays at the
+  Blockers #1 marshaller-bridge fix.
+
+  Working hypothesis for why Pattern B fails on m2's
+  reproducer: the `http_server:start` spawn is itself parked
+  inside the native `Unix.accept` loop on the boot thread; the
+  global `er-sched-*` state still has that process in its
+  queue. When the connection thread (under the per-instance
+  native mutex) calls `er-sched-run-all!`, it re-enters the
+  SAME global scheduler — the boot thread's `er-sched-step!`
+  of the http:listen process is blocked forever inside the
+  native primitive, so the connection-thread pump either
+  races against that parked frame or otherwise fails to drive
+  the new handler process to completion before the connection
+  thread returns from `sx-handler`. The fed-prims diagnosis
+  was correct that the bug is Erlang-substrate scope and that
+  Pattern A (the mutex) doesn't apply, but the Pattern B
+  sketch assumed a fresh / private scheduler context that
+  doesn't exist in the current substrate. Blockers #4
+  updated to capture this + sketch the three substrate fixes
+  that would actually work; loop pacing dialled back down.
+
 - **2026-06-07** — Step 12 prep discovered Blockers #4
  (http-listen handler holds the SX runtime mutex; any
  `gen_server:call` from inside an HTTP route deadlocks