fed-sx-m2: Pattern B from fed-prims diagnosis fails on reproducer
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 40s
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 40s
loops/fed-prims commitbf8d0bf2(merged as94f6ab9f) diagnosed Blockers #4 as Erlang-substrate scope and sketched a Pattern B fix purely in er-bif-http-listen: wrap the handler call in er-spawn-fun + er-sched-run-all! and read the spawned process's :exit-result. Tried it on lib/erlang/runtime.sx — does not work. Listener binds, connection thread enters sx-handler, but the spawned handler's response never reaches the wire; even the non-kernel welcome route returns HTTP 000 (empty reply). Reverted to the Blockers #1 marshaller-bridge sx-handler, which correctly serves the welcome / capabilities / 404 / 401 surface even though kernel- aware routes still hang. Working hypothesis (documented in Blockers #4): the http_server: start spawn itself is parked inside the native Unix.accept loop on the boot thread; the global er-sched-* state still has that process in its queue. When the connection thread (under the per-instance native mutex) calls er-sched-run-all!, it re-enters the SAME global scheduler — the boot thread's er-sched-step! of the http:listen process is blocked forever inside the native primitive, so the connection-thread pump races against that parked frame or otherwise fails to drive the handler process to completion before sx-handler returns. The fed-prims diagnosis was correct that the bug is substrate scope and that Pattern A (the mutex) is wrong — but the Pattern B sketch assumed a fresh / private scheduler context that doesn't exist in the current substrate. Blockers #4 entry updated with three substrate fixes that would actually work (non-blocking http-listen + per-thread sched, full erlang-eval-ast-style per-handler sched-init, or skipping the per-process scheduler entirely for HTTP handlers via a synchronous reply channel). m2 stays at 11/12 steps done; Step 12 remains gated. Loop pacing dialled back down — substrate work owes to loops/erlang or a follow-on fed-prims tick with a more careful design pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1105,19 +1105,60 @@ proceed.
|
||||
the first GET /actors/<id>/outbox (or any /actors/<id> with
|
||||
`Accept: application/vnd.fed-sx.actor-doc`).
|
||||
|
||||
Belongs on `loops/erlang` or `loops/fed-prims`. Two fix
|
||||
patterns:
|
||||
- Release the mutex around the `gen_server:call` reply wait
|
||||
(substrate change in http-listen's handler-call code).
|
||||
- Run the handler in a fresh er-spawn'd process so the
|
||||
gen_server runs on a different scheduler frame.
|
||||
**2026-06-07 update:** `loops/fed-prims` commit `bf8d0bf2`
|
||||
(merged to architecture as `94f6ab9f`) diagnosed this as
|
||||
Erlang-substrate scope rather than an OCaml mutex bug, and
|
||||
sketched a Pattern B fix entirely in `er-bif-http-listen`:
|
||||
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
|
||||
and read the process's `:exit-result`. m2 tried this patch on
|
||||
`lib/erlang/runtime.sx` and **it did not work**: the listener
|
||||
binds, the connection thread enters `sx-handler`, but the
|
||||
spawned process's response never reaches the wire — even the
|
||||
non-kernel welcome route returns `HTTP 000` (empty reply).
|
||||
Reproducer: spin up `http_server:start(P, [])` with the
|
||||
Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000.
|
||||
|
||||
Step 12's two-instance smoke test gates on this — without
|
||||
it, the only request shapes that survive over real HTTP are
|
||||
the static / capabilities / static-stub paths.
|
||||
Why it fails (working hypothesis, m2 worktree): the
|
||||
`http_server:start` spawn itself ran inside the outer
|
||||
`erlang-eval-ast` scheduler pump and is **parked inside the
|
||||
native `Unix.accept` loop on the boot thread**; the global
|
||||
`er-sched-*` state still has that process in its queue. When
|
||||
the connection thread calls `er-sched-run-all!` from inside
|
||||
`sx-handler`, it re-enters the SAME global scheduler that
|
||||
the boot thread is already pumping (the boot thread's
|
||||
`er-sched-step!` of the http:listen process is blocked
|
||||
forever inside the native primitive). The connection thread
|
||||
spawns its handler process fine but `er-sched-run-all!`
|
||||
either races against the boot thread's parked pump or
|
||||
otherwise fails to drive the handler to completion before
|
||||
the native handler returns. Reverted on m2 — `lib/erlang/
|
||||
runtime.sx` stays at the Blockers #1 marshaller-bridge fix,
|
||||
which is correct.
|
||||
|
||||
In-flight `smoke_federate.sh` test was withdrawn during this
|
||||
tick after the deadlock surfaced (it boots both instances
|
||||
The real fix likely needs ONE of:
|
||||
- Native http-listen registers the listener and returns
|
||||
immediately (non-blocking BIF), with the accept loop
|
||||
running on a separate native thread and the connection
|
||||
handler entering a **fresh** `er-sched-init!`-d
|
||||
scheduler context (substrate change in OCaml + a redesign
|
||||
of how er-sched-* state is partitioned by thread).
|
||||
- OR: the connection handler runs `erlang-eval-ast`-style
|
||||
(its own `er-sched-init!` + private scheduler), with the
|
||||
gen_server hosted in a way that's accessible across
|
||||
scheduler instances (substantial substrate redesign).
|
||||
- OR: skip the per-process scheduler entirely for HTTP
|
||||
handlers and use a synchronous "reply channel" pattern
|
||||
that doesn't go through `receive` (changes every
|
||||
kernel-aware Erlang module's call shape — large blast
|
||||
radius).
|
||||
|
||||
Belongs on `loops/erlang` or a follow-on `loops/fed-prims`
|
||||
tick. Step 12's two-instance smoke test gates on this —
|
||||
without it, the only request shapes that survive over real
|
||||
HTTP are the static / capabilities / static-stub paths.
|
||||
|
||||
In-flight `smoke_federate.sh` test was withdrawn during the
|
||||
initial Blockers #4 surfacing (it boots both instances
|
||||
successfully but every kernel-touching request hangs); the
|
||||
plan's Step 12 acceptance criterion stays open pending
|
||||
Blockers #4 resolution. m2's other 11 steps are fully
|
||||
@@ -1129,6 +1170,37 @@ proceed.
|
||||
|
||||
Newest first.
|
||||
|
||||
- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B
|
||||
patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`:
|
||||
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
|
||||
and read the spawned process's `:exit-result`. **It did not
|
||||
work** — listener binds, but even the non-kernel welcome route
|
||||
now returns HTTP 000 (the spawned handler's response never
|
||||
reaches the wire). The simple `sx-handler` (direct
|
||||
`er-apply-fun handler`) is preserved on m2 because it at least
|
||||
serves welcome / capabilities / 404 / 401 correctly when no
|
||||
kernel routes are touched. Reverted; runtime.sx stays at the
|
||||
Blockers #1 marshaller-bridge fix.
|
||||
|
||||
Working hypothesis for why Pattern B fails on m2's
|
||||
reproducer: the `http_server:start` spawn is itself parked
|
||||
inside the native `Unix.accept` loop on the boot thread; the
|
||||
global `er-sched-*` state still has that process in its
|
||||
queue. When the connection thread (under the per-instance
|
||||
native mutex) calls `er-sched-run-all!`, it re-enters the
|
||||
SAME global scheduler — the boot thread's `er-sched-step!`
|
||||
of the http:listen process is blocked forever inside the
|
||||
native primitive, so the connection-thread pump either
|
||||
races against that parked frame or otherwise fails to drive
|
||||
the new handler process to completion before the connection
|
||||
thread returns from `sx-handler`. The fed-prims diagnosis
|
||||
was correct that the bug is Erlang-substrate scope and that
|
||||
Pattern A (the mutex) doesn't apply, but the Pattern B
|
||||
sketch assumed a fresh / private scheduler context that
|
||||
doesn't exist in the current substrate. Blockers #4
|
||||
updated to capture this + sketch the three substrate fixes
|
||||
that would actually work; loop pacing dialled back down.
|
||||
|
||||
- **2026-06-07** — Step 12 prep discovered Blockers #4
|
||||
(http-listen handler holds the SX runtime mutex; any
|
||||
`gen_server:call` from inside an HTTP route deadlocks
|
||||
|
||||
Reference in New Issue
Block a user