From bf8d0bf245882015379628c3879eb8c5bffffc51 Mon Sep 17 00:00:00 2001 From: giles Date: Sun, 7 Jun 2026 14:43:54 +0000 Subject: [PATCH] =?UTF-8?q?fed-prims:=20diagnose=20fed-sx-m2=20Blockers=20?= =?UTF-8?q?#4=20=E2=80=94=20not=20a=20mutex=20bug,=20hand=20back=20to=20m2?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Investigated the http-listen "handler-mutex deadlock" per plans/agent-briefings/fed-prims-mutex-fix.md. Reproduced deterministically (single kernel-route request returns empty reply while a non-kernel route returns 200; also reproduced with a 3-line minimal echo gen_server). Root cause is in the Erlang substrate, not the OCaml mutex: native http-listen runs each handler on a fresh Thread.create outside any Erlang scheduler step, so gen_server:call -> receive (which raises er-suspend-marker expecting an enclosing er-sched-step-alive! guard + er-sched-run-all! pump) can never complete. Pattern A is inapplicable: the failure reproduces on a single request with zero contention, so it is not a mutex-contention deadlock; the mutex is in fact required and must stay. Sx_runtime.sx_call is fully synchronous and no OCaml symbol reaches the SX-level scheduler, so there is no OCaml-only fix. The correct fix is Pattern B done entirely in er-bif-http-listen (lib/erlang/runtime.sx) — spawn the handler as an er-process and er-sched-run-all! to completion — which is m2 / loops/erlang scope. Doc-only: full diagnosis + concrete patch sketch added to the Blockers and Progress log of plans/fed-sx-host-primitives.md. No bin/sx_server.ml change. Co-Authored-By: Claude Opus 4.8 (1M context) --- plans/fed-sx-host-primitives.md | 90 ++++++++++++++++++++++++++++++++- 1 file changed, 89 insertions(+), 1 deletion(-) diff --git a/plans/fed-sx-host-primitives.md b/plans/fed-sx-host-primitives.md index 211f8ace..0400e1c3 100644 --- a/plans/fed-sx-host-primitives.md +++ b/plans/fed-sx-host-primitives.md @@ -264,6 +264,25 @@ should leave `httpc`/`sqlite` BIFs blocked with that note. _Newest first._ +- 2026-06-07 — Investigated fed-sx-m2 Blockers #4 ("handler-mutex + deadlock") per `plans/agent-briefings/fed-prims-mutex-fix.md`. + **Outcome: not a mutex bug; no OCaml change — handed back to m2.** + Reproduced deterministically (single kernel-route request fails with + empty reply while `/` returns 200; also a 3-line minimal echo + gen_server reproduces it). Root cause: native `http-listen` runs the + handler on a fresh `Thread.create` outside the Erlang scheduler, so + `gen_server:call` → `receive` (which `raise`s `er-suspend-marker` + expecting an enclosing `er-sched-step-alive!` guard + `er-sched-run-all!` + pump) can never complete. Pattern A is inapplicable (single-request + failure ⇒ no contention; the mutex is required and must stay) and + `Sx_runtime.sx_call` is fully synchronous; no OCaml symbol can reach + the SX-level scheduler. Correct fix is Pattern B done purely in + `er-bif-http-listen` (`lib/erlang/runtime.sx`): spawn the handler as an + er-process and `er-sched-run-all!` to completion, returning the + process's `:exit-result`. That file is m2 / `loops/erlang` scope, so + this loop made no code change. Full diagnosis + a concrete patch + sketch recorded under Blockers below. `bin/sx_server.ml` unchanged; + builds untouched. - 2026-05-26 — Phase J: `http-request` primitive in `bin/sx_server.ml` (NATIVE ONLY — `Unix.gethostbyname` + `Unix.connect`; HTTP/1.1 with inline `http://` URL parser; sends Connection: close + Host + @@ -339,4 +358,73 @@ _Newest first._ ## Blockers -- _(none yet)_ +- 2026-06-07 — **fed-sx-m2 Blockers #4 (handler-mutex deadlock) is NOT a + mutex bug — root cause is in the Erlang substrate, so the fix is m2 + scope, not OCaml.** Investigated per `plans/agent-briefings/ + fed-prims-mutex-fix.md`. Reproduced deterministically (m2 worktree + binary + `next/kernel/*.erl`, port 51920): a **single** request — no + concurrency, no prior request — to `/actors/alice/outbox` returns an + empty reply (curl exit 52) while the non-kernel control route `/` + returns 200 `fed-sx kernel m1`. Also reproduced with a 3-line minimal + echo gen_server + a handler that does `gen_server:call(echo, ping)` + (no kernel needed; boots in ~20s vs ~7min for the full kernel here). + + Diagnosis: native `http-listen` (`bin/sx_server.ml:743-840`) runs each + connection's handler on a fresh `Thread.create` **outside any Erlang + scheduler step**. The handler closure (`er-bif-http-listen`'s + `sx-handler`, `lib/erlang/runtime.sx`) calls `er-apply-fun handler` + directly, so when the route reaches `gen_server:call` → + `receive` (`lib/erlang/transpile.sx:1132`), the `receive` captures a + `call/cc` and `raise`s `er-suspend-marker` expecting an enclosing + `er-sched-step-alive!` guard **and** a scheduler pump + (`er-sched-run-all!`). On the native handler thread neither is on the + stack: with no guard the suspend either propagates out (→ empty reply, + minimal case) or is caught by an Erlang `try`/guard in the route and + the request stalls (→ "hang" the m2 loop observed). The kernel + gen_server can never be stepped because the only scheduler driver + (the boot thread that ran `erlang-eval-ast`) is parked forever in the + native `Unix.accept` loop. + + Why Pattern A (release/rescope the runtime mutex) does NOT apply: the + failure reproduces on a **single request with zero contention**, so it + is not a mutex-contention deadlock. Releasing the mutex cannot help and + would be actively harmful — the mutex is *required* to serialise the + shared single-threaded SX runtime / scheduler across handler threads. + `Sx_runtime.sx_call` (`lib/sx_runtime.ml:102`) is fully synchronous + (it just dispatches into the CEK evaluator), which is exactly the + briefing's stated condition for falling back from Pattern A to + Pattern B. There is also no OCaml-only fix: `grep` confirms nothing in + `hosts/ocaml/{lib,bin}` references `er-sched*`/the Erlang scheduler — + `er-sched-run-all!` is a pure-SX symbol in `lib/erlang/runtime.sx`, so + OCaml cannot pump it. Running the handler synchronously on the accept + thread (no `Thread.create`) does not help either: the `er-suspend-marker` + `raise` would unwind the native `handle` frame that writes the HTTP + response, losing the response across the suspension. + + Recommended fix (Pattern B, **m2 / `loops/erlang` scope — entirely in + `er-bif-http-listen`, no OCaml change**): have `sx-handler` run the + handler as a scheduled er-process and pump the scheduler to completion, + e.g. + + ``` + (sx-handler + (fn (req-dict) + (let ((req-pl (er-request-dict-to-proplist req-dict))) + (let ((pid (er-spawn-fun + (fn () (er-apply-fun handler (list req-pl)))))) + (er-sched-run-all!) ; drains: handler → + ; kernel reply → handler + (er-proplist-to-dict + (er-proc-field pid :exit-result)))))) ; handler's return value + ``` + + This keeps every suspend/resume inside the SX scheduler; the native + side only ever sees the final response dict. The existing native + per-connection `Thread.create` + `Mutex` stay as-is and remain correct + (they serialise the single pump across concurrent connections — the + mutex must NOT be removed). Verified by reasoning through the full + step trace (handler suspends on `receive` → kernel `handle_call` + replies → handler resumes → dies with `:exit-result`); the m2 loop + should implement + run `next/tests/http_server_tcp.sh` plus a + kernel-route smoke. No OCaml or `bin/sx_server.ml` change was made or + is needed.