# fed-prims handler-mutex deadlock fix (one-shot) Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s `http-listen` handler that blocks every `gen_server:call` from inside an Erlang route. Documented as **Blockers #4** in `/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`. ``` description: fed-prims handler-mutex deadlock fix subagent_type: general-purpose run_in_background: true isolation: worktree ``` ## Worktree + branch Already provisioned at `/root/rose-ash-loops/fed-prims` on branch `loops/fed-prims` (the fed-prims phases A–J are landed; this is a follow-up fix). Start there. Never push to `main` or `architecture`. If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/ scheduled_tasks.lock` is dirty, just leave them alone — they're harness state. Stash if you must, but don't commit them. ## The problem (verified by fed-sx-m2 loop, 2026-06-07) Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+` serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx` so the SX runtime isn't re-entered concurrently: ```ocaml Mutex.lock mtx; let resp = (try Sx_runtime.sx_call handler [Dict req] with e -> Mutex.unlock mtx; raise e) in Mutex.unlock mtx; ``` When the Erlang handler does `gen_server:call(nx_kernel, ...)` from any kernel-aware route (`actor_doc_response_for/3`, `actor_outbox_response_for/3`, `handle_inbox_post`, `nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX runtime scheduler to run — but the calling handler is sitting on the runtime mutex. Deadlock; curl hangs until `--max-time` fires. **Verification recipe (reproduces deterministically):** ```bash PORT=51920 cat > /tmp/boot.sx <<'SX' (epoch 1) (load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx") (load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx") (load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx") (load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx") (epoch 2) (eval "(er-load-gen-server!)") (eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)") (eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)") (epoch 20) (eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")") SX mkfifo /tmp/fifo ( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo & hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 & sleep 60 # boot takes ~30-45s cold curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200 curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS ``` The `next/kernel/*.erl` files referenced live in the fed-sx-m2 worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can read them there for context but do NOT edit them — Erlang-side work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`. ## Two fix patterns Pick **one**. Both are independent enough to evaluate alone; commit the one that lands first. ### Pattern A — release the mutex around the SX call The mutex exists to serialise SX runtime mutation. But once the runtime hands the call off to the gen_server (which has its own scheduler frame), the calling thread is just waiting on a reply message; it doesn't need the mutex. The fix is to scope the mutex *only* over the runtime entry, not the entire handler invocation. This may require restructuring `Sx_runtime.sx_call handler [Dict req]` so the call yields to the scheduler instead of blocking — verify by reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives). If `sx_call` is fully synchronous and re-entry is genuinely unsafe, fall back to Pattern B. ### Pattern B — spawn handler in a fresh er-process Erlang processes already have their own scheduler frame. Have the handler closure trampoline through `er-spawn-fun` (or equivalent — check `lib/erlang/runtime.sx`'s existing process primitives) so the gen_server reply runs in a different frame from the http-listen accept-loop thread. This may be cleaner if it can be done entirely at the SX/Erlang layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which case **this is m2 scope** and you should hand it back rather than edit OCaml. Read the BIF body first — if a pure-Erlang spawn suffices, document that and stop without committing OCaml changes. The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound marshallers (commit `8d33d02f`). The handler is invoked inside `(http-listen port sx-handler)` — figure out whether you can `er-spawn-fun` around the body of `sx-handler` such that the spawned process's gen_server:call doesn't fight the parent's runtime mutex. ## Acceptance — the unblock target `next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple GET / + capabilities + 404 + 401 surface). PLUS: A kernel-touching request over real HTTP must return without hanging. The minimal smoke for this is: ```bash # In the verification recipe above, after boot: curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # Expected: "outbox: alice\ntip: 0\n" or similar (200 with body), # NOT a timeout. ``` If you want a one-shot script, save the recipe above as a regression test inside the fed-prims worktree: `hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it pass deterministically with a generous timeout (≥120s for the cold boot). ## Ground rules (hard) - **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md` (m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang substrate / loops/erlang owns that). - **No-regression gate:** - `dune build bin/sx_server.exe` (native) green - `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green - `bash lib/erlang/conformance.sh` 761/761 - `bash next/tests/http_server_tcp.sh` 5/5 - **WASM safety:** Pattern A may need Thread / Mutex juggling that isn't WASM-safe. The `http-listen` primitive is already native-only, so changes to its handler code don't need to build under WASM — but anything in `lib/sx_runtime.ml` does. If your change has to add `Thread`/`Mutex` to `lib/`, you've picked the wrong fix; back out. - **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh` ≥400s. `test_boot.sh` ≥60s. - **Commit granularity:** one fix, one commit. Title like: `fed-prims: release runtime mutex around gen_server:call (Blockers #4)`. - **No `.sx` edits.** All work is `.ml` (or `.sh` for the regression test). sx-tree MCP is not needed. - **Worktree:** commit, push `origin/loops/fed-prims`. Never `main`, never `architecture`. The user merges to architecture separately. ## What to write back Append one dated line to `plans/fed-sx-host-primitives.md`'s Progress log (newest first): ``` - 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex deadlock). . Verified via hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh 5/5 + conformance 761/761 + WASM boot. ``` Once landed, the fed-sx-m2 loop will pick up the fix on its next tick and unblock Step 12 — you don't need to coordinate. ## If it's not Pattern A or Pattern B If you discover the deadlock is something else entirely (e.g., a gen_server config issue, a different lock in `Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame), document what you found in a fresh Blockers entry on `plans/fed-sx-host-primitives.md` and stop. The m2 loop will re-check on its next tick. **Do not invent a Pattern C without clear evidence** — the deadlock is reproducible and the two patterns above cover the obvious fix shapes. Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.