diff --git a/plans/agent-briefings/fed-prims-mutex-fix.md b/plans/agent-briefings/fed-prims-mutex-fix.md new file mode 100644 index 00000000..eb07a756 --- /dev/null +++ b/plans/agent-briefings/fed-prims-mutex-fix.md @@ -0,0 +1,197 @@ +# fed-prims handler-mutex deadlock fix (one-shot) + +Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s +`http-listen` handler that blocks every `gen_server:call` from inside +an Erlang route. Documented as **Blockers #4** in +`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`. + +``` +description: fed-prims handler-mutex deadlock fix +subagent_type: general-purpose +run_in_background: true +isolation: worktree +``` + +## Worktree + branch + +Already provisioned at `/root/rose-ash-loops/fed-prims` on branch +`loops/fed-prims` (the fed-prims phases A–J are landed; this is a +follow-up fix). Start there. Never push to `main` or `architecture`. + +If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/ +scheduled_tasks.lock` is dirty, just leave them alone — they're +harness state. Stash if you must, but don't commit them. + +## The problem (verified by fed-sx-m2 loop, 2026-06-07) + +Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+` +serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx` +so the SX runtime isn't re-entered concurrently: + +```ocaml +Mutex.lock mtx; +let resp = + (try Sx_runtime.sx_call handler [Dict req] + with e -> Mutex.unlock mtx; raise e) in +Mutex.unlock mtx; +``` + +When the Erlang handler does `gen_server:call(nx_kernel, ...)` from +any kernel-aware route (`actor_doc_response_for/3`, +`actor_outbox_response_for/3`, `handle_inbox_post`, +`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX +runtime scheduler to run — but the calling handler is sitting on the +runtime mutex. Deadlock; curl hangs until `--max-time` fires. + +**Verification recipe (reproduces deterministically):** + +```bash +PORT=51920 +cat > /tmp/boot.sx <<'SX' +(epoch 1) +(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx") +(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx") +(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx") +(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx") +(epoch 2) +(eval "(er-load-gen-server!)") +(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)") +(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)") +(epoch 20) +(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")") +SX +mkfifo /tmp/fifo +( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo & +hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 & +sleep 60 # boot takes ~30-45s cold +curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200 +curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS +``` + +The `next/kernel/*.erl` files referenced live in the fed-sx-m2 +worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can +read them there for context but do NOT edit them — Erlang-side +work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`. + +## Two fix patterns + +Pick **one**. Both are independent enough to evaluate alone; commit +the one that lands first. + +### Pattern A — release the mutex around the SX call + +The mutex exists to serialise SX runtime mutation. But once the +runtime hands the call off to the gen_server (which has its own +scheduler frame), the calling thread is just waiting on a reply +message; it doesn't need the mutex. The fix is to scope the mutex +*only* over the runtime entry, not the entire handler invocation. + +This may require restructuring `Sx_runtime.sx_call handler [Dict req]` +so the call yields to the scheduler instead of blocking — verify by +reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives). +If `sx_call` is fully synchronous and re-entry is genuinely unsafe, +fall back to Pattern B. + +### Pattern B — spawn handler in a fresh er-process + +Erlang processes already have their own scheduler frame. Have the +handler closure trampoline through `er-spawn-fun` (or equivalent — +check `lib/erlang/runtime.sx`'s existing process primitives) so the +gen_server reply runs in a different frame from the http-listen +accept-loop thread. + +This may be cleaner if it can be done entirely at the SX/Erlang +layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which +case **this is m2 scope** and you should hand it back rather than +edit OCaml. Read the BIF body first — if a pure-Erlang spawn +suffices, document that and stop without committing OCaml changes. + +The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the +fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound +marshallers (commit `8d33d02f`). The handler is invoked inside +`(http-listen port sx-handler)` — figure out whether you can +`er-spawn-fun` around the body of `sx-handler` such that the +spawned process's gen_server:call doesn't fight the parent's +runtime mutex. + +## Acceptance — the unblock target + +`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple +GET / + capabilities + 404 + 401 surface). PLUS: + +A kernel-touching request over real HTTP must return without +hanging. The minimal smoke for this is: + +```bash +# In the verification recipe above, after boot: +curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" +# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body), +# NOT a timeout. +``` + +If you want a one-shot script, save the recipe above as a regression +test inside the fed-prims worktree: +`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it +pass deterministically with a generous timeout (≥120s for the cold +boot). + +## Ground rules (hard) + +- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent + `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is + defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md` + (m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang + substrate / loops/erlang owns that). +- **No-regression gate:** + - `dune build bin/sx_server.exe` (native) green + - `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green + - `bash lib/erlang/conformance.sh` 761/761 + - `bash next/tests/http_server_tcp.sh` 5/5 +- **WASM safety:** Pattern A may need Thread / Mutex juggling + that isn't WASM-safe. The `http-listen` primitive is already + native-only, so changes to its handler code don't need to + build under WASM — but anything in `lib/sx_runtime.ml` does. + If your change has to add `Thread`/`Mutex` to `lib/`, you've + picked the wrong fix; back out. +- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh` + ≥400s. `test_boot.sh` ≥60s. +- **Commit granularity:** one fix, one commit. Title like: + `fed-prims: release runtime mutex around gen_server:call (Blockers #4)`. +- **No `.sx` edits.** All work is `.ml` (or `.sh` for the + regression test). sx-tree MCP is not needed. +- **Worktree:** commit, push `origin/loops/fed-prims`. Never + `main`, never `architecture`. The user merges to architecture + separately. + +## What to write back + +Append one dated line to `plans/fed-sx-host-primitives.md`'s +Progress log (newest first): + +``` +- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex + deadlock). . Verified via + hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh + 5/5 + conformance 761/761 + WASM boot. +``` + +Once landed, the fed-sx-m2 loop will pick up the fix on its next +tick and unblock Step 12 — you don't need to coordinate. + +## If it's not Pattern A or Pattern B + +If you discover the deadlock is something else entirely +(e.g., a gen_server config issue, a different lock in +`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame), +document what you found in a fresh Blockers entry on +`plans/fed-sx-host-primitives.md` and stop. The m2 loop will +re-check on its next tick. **Do not invent a Pattern C without +clear evidence** — the deadlock is reproducible and the two +patterns above cover the obvious fix shapes. + +Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.