fed-sx-m2: Pattern B from fed-prims diagnosis fails on reproducer
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 40s

loops/fed-prims commit bf8d0bf2 (merged as 94f6ab9f) diagnosed
Blockers #4 as Erlang-substrate scope and sketched a Pattern B fix
purely in er-bif-http-listen: wrap the handler call in er-spawn-fun
+ er-sched-run-all! and read the spawned process's :exit-result.

Tried it on lib/erlang/runtime.sx — does not work. Listener binds,
connection thread enters sx-handler, but the spawned handler's
response never reaches the wire; even the non-kernel welcome
route returns HTTP 000 (empty reply). Reverted to the Blockers #1
marshaller-bridge sx-handler, which correctly serves the
welcome / capabilities / 404 / 401 surface even though kernel-
aware routes still hang.

Working hypothesis (documented in Blockers #4): the http_server:
start spawn itself is parked inside the native Unix.accept loop on
the boot thread; the global er-sched-* state still has that
process in its queue. When the connection thread (under the
per-instance native mutex) calls er-sched-run-all!, it re-enters
the SAME global scheduler — the boot thread's er-sched-step! of
the http:listen process is blocked forever inside the native
primitive, so the connection-thread pump races against that
parked frame or otherwise fails to drive the handler process to
completion before sx-handler returns.

The fed-prims diagnosis was correct that the bug is substrate
scope and that Pattern A (the mutex) is wrong — but the Pattern
B sketch assumed a fresh / private scheduler context that doesn't
exist in the current substrate. Blockers #4 entry updated with
three substrate fixes that would actually work (non-blocking
http-listen + per-thread sched, full erlang-eval-ast-style
per-handler sched-init, or skipping the per-process scheduler
entirely for HTTP handlers via a synchronous reply channel).

m2 stays at 11/12 steps done; Step 12 remains gated. Loop pacing
dialled back down — substrate work owes to loops/erlang or a
follow-on fed-prims tick with a more careful design pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 15:21:18 +00:00
parent 136deb1daf
commit 1d771aedea

View File

@@ -1105,19 +1105,60 @@ proceed.
the first GET /actors/<id>/outbox (or any /actors/<id> with
`Accept: application/vnd.fed-sx.actor-doc`).
Belongs on `loops/erlang` or `loops/fed-prims`. Two fix
patterns:
- Release the mutex around the `gen_server:call` reply wait
(substrate change in http-listen's handler-call code).
- Run the handler in a fresh er-spawn'd process so the
gen_server runs on a different scheduler frame.
**2026-06-07 update:** `loops/fed-prims` commit `bf8d0bf2`
(merged to architecture as `94f6ab9f`) diagnosed this as
Erlang-substrate scope rather than an OCaml mutex bug, and
sketched a Pattern B fix entirely in `er-bif-http-listen`:
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
and read the process's `:exit-result`. m2 tried this patch on
`lib/erlang/runtime.sx` and **it did not work**: the listener
binds, the connection thread enters `sx-handler`, but the
spawned process's response never reaches the wire — even the
non-kernel welcome route returns `HTTP 000` (empty reply).
Reproducer: spin up `http_server:start(P, [])` with the
Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000.
Step 12's two-instance smoke test gates on this — without
it, the only request shapes that survive over real HTTP are
the static / capabilities / static-stub paths.
Why it fails (working hypothesis, m2 worktree): the
`http_server:start` spawn itself ran inside the outer
`erlang-eval-ast` scheduler pump and is **parked inside the
native `Unix.accept` loop on the boot thread**; the global
`er-sched-*` state still has that process in its queue. When
the connection thread calls `er-sched-run-all!` from inside
`sx-handler`, it re-enters the SAME global scheduler that
the boot thread is already pumping (the boot thread's
`er-sched-step!` of the http:listen process is blocked
forever inside the native primitive). The connection thread
spawns its handler process fine but `er-sched-run-all!`
either races against the boot thread's parked pump or
otherwise fails to drive the handler to completion before
the native handler returns. Reverted on m2 — `lib/erlang/
runtime.sx` stays at the Blockers #1 marshaller-bridge fix,
which is correct.
In-flight `smoke_federate.sh` test was withdrawn during this
tick after the deadlock surfaced (it boots both instances
The real fix likely needs ONE of:
- Native http-listen registers the listener and returns
immediately (non-blocking BIF), with the accept loop
running on a separate native thread and the connection
handler entering a **fresh** `er-sched-init!`-d
scheduler context (substrate change in OCaml + a redesign
of how er-sched-* state is partitioned by thread).
- OR: the connection handler runs `erlang-eval-ast`-style
(its own `er-sched-init!` + private scheduler), with the
gen_server hosted in a way that's accessible across
scheduler instances (substantial substrate redesign).
- OR: skip the per-process scheduler entirely for HTTP
handlers and use a synchronous "reply channel" pattern
that doesn't go through `receive` (changes every
kernel-aware Erlang module's call shape — large blast
radius).
Belongs on `loops/erlang` or a follow-on `loops/fed-prims`
tick. Step 12's two-instance smoke test gates on this —
without it, the only request shapes that survive over real
HTTP are the static / capabilities / static-stub paths.
In-flight `smoke_federate.sh` test was withdrawn during the
initial Blockers #4 surfacing (it boots both instances
successfully but every kernel-touching request hangs); the
plan's Step 12 acceptance criterion stays open pending
Blockers #4 resolution. m2's other 11 steps are fully
@@ -1129,6 +1170,37 @@ proceed.
Newest first.
- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B
patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`:
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
and read the spawned process's `:exit-result`. **It did not
work** — listener binds, but even the non-kernel welcome route
now returns HTTP 000 (the spawned handler's response never
reaches the wire). The simple `sx-handler` (direct
`er-apply-fun handler`) is preserved on m2 because it at least
serves welcome / capabilities / 404 / 401 correctly when no
kernel routes are touched. Reverted; runtime.sx stays at the
Blockers #1 marshaller-bridge fix.
Working hypothesis for why Pattern B fails on m2's
reproducer: the `http_server:start` spawn is itself parked
inside the native `Unix.accept` loop on the boot thread; the
global `er-sched-*` state still has that process in its
queue. When the connection thread (under the per-instance
native mutex) calls `er-sched-run-all!`, it re-enters the
SAME global scheduler — the boot thread's `er-sched-step!`
of the http:listen process is blocked forever inside the
native primitive, so the connection-thread pump either
races against that parked frame or otherwise fails to drive
the new handler process to completion before the connection
thread returns from `sx-handler`. The fed-prims diagnosis
was correct that the bug is Erlang-substrate scope and that
Pattern A (the mutex) doesn't apply, but the Pattern B
sketch assumed a fresh / private scheduler context that
doesn't exist in the current substrate. Blockers #4
updated to capture this + sketch the three substrate fixes
that would actually work; loop pacing dialled back down.
- **2026-06-07** — Step 12 prep discovered Blockers #4
(http-listen handler holds the SX runtime mutex; any
`gen_server:call` from inside an HTTP route deadlocks