fed-prims: diagnose fed-sx-m2 Blockers #4 — not a mutex bug, hand back to m2
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 1m7s

Investigated the http-listen "handler-mutex deadlock" per
plans/agent-briefings/fed-prims-mutex-fix.md. Reproduced deterministically
(single kernel-route request returns empty reply while a non-kernel route
returns 200; also reproduced with a 3-line minimal echo gen_server).

Root cause is in the Erlang substrate, not the OCaml mutex: native
http-listen runs each handler on a fresh Thread.create outside any Erlang
scheduler step, so gen_server:call -> receive (which raises er-suspend-marker
expecting an enclosing er-sched-step-alive! guard + er-sched-run-all! pump)
can never complete.

Pattern A is inapplicable: the failure reproduces on a single request with
zero contention, so it is not a mutex-contention deadlock; the mutex is in
fact required and must stay. Sx_runtime.sx_call is fully synchronous and no
OCaml symbol reaches the SX-level scheduler, so there is no OCaml-only fix.
The correct fix is Pattern B done entirely in er-bif-http-listen
(lib/erlang/runtime.sx) — spawn the handler as an er-process and
er-sched-run-all! to completion — which is m2 / loops/erlang scope.

Doc-only: full diagnosis + concrete patch sketch added to the Blockers and
Progress log of plans/fed-sx-host-primitives.md. No bin/sx_server.ml change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 14:43:54 +00:00
parent 46e0653911
commit bf8d0bf245

View File

@@ -264,6 +264,25 @@ should leave `httpc`/`sqlite` BIFs blocked with that note.
_Newest first._
- 2026-06-07 — Investigated fed-sx-m2 Blockers #4 ("handler-mutex
deadlock") per `plans/agent-briefings/fed-prims-mutex-fix.md`.
**Outcome: not a mutex bug; no OCaml change — handed back to m2.**
Reproduced deterministically (single kernel-route request fails with
empty reply while `/` returns 200; also a 3-line minimal echo
gen_server reproduces it). Root cause: native `http-listen` runs the
handler on a fresh `Thread.create` outside the Erlang scheduler, so
`gen_server:call` → `receive` (which `raise`s `er-suspend-marker`
expecting an enclosing `er-sched-step-alive!` guard + `er-sched-run-all!`
pump) can never complete. Pattern A is inapplicable (single-request
failure ⇒ no contention; the mutex is required and must stay) and
`Sx_runtime.sx_call` is fully synchronous; no OCaml symbol can reach
the SX-level scheduler. Correct fix is Pattern B done purely in
`er-bif-http-listen` (`lib/erlang/runtime.sx`): spawn the handler as an
er-process and `er-sched-run-all!` to completion, returning the
process's `:exit-result`. That file is m2 / `loops/erlang` scope, so
this loop made no code change. Full diagnosis + a concrete patch
sketch recorded under Blockers below. `bin/sx_server.ml` unchanged;
builds untouched.
- 2026-05-26 — Phase J: `http-request` primitive in `bin/sx_server.ml`
(NATIVE ONLY — `Unix.gethostbyname` + `Unix.connect`; HTTP/1.1 with
inline `http://` URL parser; sends Connection: close + Host +
@@ -339,4 +358,73 @@ _Newest first._
## Blockers
- _(none yet)_
- 2026-06-07 — **fed-sx-m2 Blockers #4 (handler-mutex deadlock) is NOT a
mutex bug — root cause is in the Erlang substrate, so the fix is m2
scope, not OCaml.** Investigated per `plans/agent-briefings/
fed-prims-mutex-fix.md`. Reproduced deterministically (m2 worktree
binary + `next/kernel/*.erl`, port 51920): a **single** request — no
concurrency, no prior request — to `/actors/alice/outbox` returns an
empty reply (curl exit 52) while the non-kernel control route `/`
returns 200 `fed-sx kernel m1`. Also reproduced with a 3-line minimal
echo gen_server + a handler that does `gen_server:call(echo, ping)`
(no kernel needed; boots in ~20s vs ~7min for the full kernel here).
Diagnosis: native `http-listen` (`bin/sx_server.ml:743-840`) runs each
connection's handler on a fresh `Thread.create` **outside any Erlang
scheduler step**. The handler closure (`er-bif-http-listen`'s
`sx-handler`, `lib/erlang/runtime.sx`) calls `er-apply-fun handler`
directly, so when the route reaches `gen_server:call` →
`receive` (`lib/erlang/transpile.sx:1132`), the `receive` captures a
`call/cc` and `raise`s `er-suspend-marker` expecting an enclosing
`er-sched-step-alive!` guard **and** a scheduler pump
(`er-sched-run-all!`). On the native handler thread neither is on the
stack: with no guard the suspend either propagates out (→ empty reply,
minimal case) or is caught by an Erlang `try`/guard in the route and
the request stalls (→ "hang" the m2 loop observed). The kernel
gen_server can never be stepped because the only scheduler driver
(the boot thread that ran `erlang-eval-ast`) is parked forever in the
native `Unix.accept` loop.
Why Pattern A (release/rescope the runtime mutex) does NOT apply: the
failure reproduces on a **single request with zero contention**, so it
is not a mutex-contention deadlock. Releasing the mutex cannot help and
would be actively harmful — the mutex is *required* to serialise the
shared single-threaded SX runtime / scheduler across handler threads.
`Sx_runtime.sx_call` (`lib/sx_runtime.ml:102`) is fully synchronous
(it just dispatches into the CEK evaluator), which is exactly the
briefing's stated condition for falling back from Pattern A to
Pattern B. There is also no OCaml-only fix: `grep` confirms nothing in
`hosts/ocaml/{lib,bin}` references `er-sched*`/the Erlang scheduler —
`er-sched-run-all!` is a pure-SX symbol in `lib/erlang/runtime.sx`, so
OCaml cannot pump it. Running the handler synchronously on the accept
thread (no `Thread.create`) does not help either: the `er-suspend-marker`
`raise` would unwind the native `handle` frame that writes the HTTP
response, losing the response across the suspension.
Recommended fix (Pattern B, **m2 / `loops/erlang` scope — entirely in
`er-bif-http-listen`, no OCaml change**): have `sx-handler` run the
handler as a scheduled er-process and pump the scheduler to completion,
e.g.
```
(sx-handler
(fn (req-dict)
(let ((req-pl (er-request-dict-to-proplist req-dict)))
(let ((pid (er-spawn-fun
(fn () (er-apply-fun handler (list req-pl))))))
(er-sched-run-all!) ; drains: handler →
; kernel reply → handler
(er-proplist-to-dict
(er-proc-field pid :exit-result)))))) ; handler's return value
```
This keeps every suspend/resume inside the SX scheduler; the native
side only ever sees the final response dict. The existing native
per-connection `Thread.create` + `Mutex` stay as-is and remain correct
(they serialise the single pump across concurrent connections — the
mutex must NOT be removed). Verified by reasoning through the full
step trace (handler suspends on `receive` → kernel `handle_call`
replies → handler resumes → dies with `:exit-result`); the m2 loop
should implement + run `next/tests/http_server_tcp.sh` plus a
kernel-route smoke. No OCaml or `bin/sx_server.ml` change was made or
is needed.