Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The http-listen handler holds the SX runtime mutex; any gen_server:call from inside a route deadlocks because the gen_server reply scheduler needs the runtime the caller is sitting on. m2's Step 12 two-instance smoke test gates on this. Briefing pre-loads the fix-loop agent with: - Verified reproducer (deterministic curl-hang against http_server:start(P, [{kernel, nx_kernel}])) - Two fix-pattern candidates (release mutex around sx_call vs spawn handler in fresh er-process) - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel- aware request passes without hanging - Scope guardrails: only hosts/ocaml/bin/sx_server.ml + adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are OFF LIMITS Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims already exists (Phases A-J landed). This is a follow-up fix loop, not a continuation of the original phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.3 KiB
fed-prims handler-mutex deadlock fix (one-shot)
Role: fix the SX runtime mutex deadlock in bin/sx_server.ml's
http-listen handler that blocks every gen_server:call from inside
an Erlang route. Documented as Blockers #4 in
/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md.
description: fed-prims handler-mutex deadlock fix
subagent_type: general-purpose
run_in_background: true
isolation: worktree
Worktree + branch
Already provisioned at /root/rose-ash-loops/fed-prims on branch
loops/fed-prims (the fed-prims phases A–J are landed; this is a
follow-up fix). Start there. Never push to main or architecture.
If .mcp.json shows a non-absolute mcp_tree path or .claude/ scheduled_tasks.lock is dirty, just leave them alone — they're
harness state. Stash if you must, but don't commit them.
The problem (verified by fed-sx-m2 loop, 2026-06-07)
Native http-listen in hosts/ocaml/bin/sx_server.ml:735+
serialises handler calls with Mutex.lock mtx / Mutex.unlock mtx
so the SX runtime isn't re-entered concurrently:
Mutex.lock mtx;
let resp =
(try Sx_runtime.sx_call handler [Dict req]
with e -> Mutex.unlock mtx; raise e) in
Mutex.unlock mtx;
When the Erlang handler does gen_server:call(nx_kernel, ...) from
any kernel-aware route (actor_doc_response_for/3,
actor_outbox_response_for/3, handle_inbox_post,
nx_kernel:state_for/1, etc.), the gen_server's reply needs the SX
runtime scheduler to run — but the calling handler is sitting on the
runtime mutex. Deadlock; curl hangs until --max-time fires.
Verification recipe (reproduces deterministically):
PORT=51920
cat > /tmp/boot.sx <<'SX'
(epoch 1)
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
(epoch 2)
(eval "(er-load-gen-server!)")
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
(epoch 20)
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
SX
mkfifo /tmp/fifo
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
sleep 60 # boot takes ~30-45s cold
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
The next/kernel/*.erl files referenced live in the fed-sx-m2
worktree at /root/rose-ash-loops/fed-sx-m1/next/kernel/. You can
read them there for context but do NOT edit them — Erlang-side
work is m2's loop. This loop only touches hosts/ocaml/bin/sx_server.ml.
Two fix patterns
Pick one. Both are independent enough to evaluate alone; commit the one that lands first.
Pattern A — release the mutex around the SX call
The mutex exists to serialise SX runtime mutation. But once the runtime hands the call off to the gen_server (which has its own scheduler frame), the calling thread is just waiting on a reply message; it doesn't need the mutex. The fix is to scope the mutex only over the runtime entry, not the entire handler invocation.
This may require restructuring Sx_runtime.sx_call handler [Dict req]
so the call yields to the scheduler instead of blocking — verify by
reading hosts/ocaml/lib/sx_runtime.ml (or wherever sx_call lives).
If sx_call is fully synchronous and re-entry is genuinely unsafe,
fall back to Pattern B.
Pattern B — spawn handler in a fresh er-process
Erlang processes already have their own scheduler frame. Have the
handler closure trampoline through er-spawn-fun (or equivalent —
check lib/erlang/runtime.sx's existing process primitives) so the
gen_server reply runs in a different frame from the http-listen
accept-loop thread.
This may be cleaner if it can be done entirely at the SX/Erlang
layer (in er-bif-http-listen in lib/erlang/runtime.sx), in which
case this is m2 scope and you should hand it back rather than
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
suffices, document that and stop without committing OCaml changes.
The BIF body is at lib/erlang/runtime.sx:1581-1632 (in the
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
marshallers (commit 8d33d02f). The handler is invoked inside
(http-listen port sx-handler) — figure out whether you can
er-spawn-fun around the body of sx-handler such that the
spawned process's gen_server:call doesn't fight the parent's
runtime mutex.
Acceptance — the unblock target
next/tests/http_server_tcp.sh 5/5 stays green (the existing simple
GET / + capabilities + 404 + 401 surface). PLUS:
A kernel-touching request over real HTTP must return without hanging. The minimal smoke for this is:
# In the verification recipe above, after boot:
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
# NOT a timeout.
If you want a one-shot script, save the recipe above as a regression
test inside the fed-prims worktree:
hosts/ocaml/test/handler_kernel_unblock.sh (new file). Make it
pass deterministically with a generous timeout (≥120s for the cold
boot).
Ground rules (hard)
- Scope:
hosts/ocaml/bin/sx_server.mland adjacenthosts/ocaml/lib/sx_runtime.ml(or whereversx_callis defined). Do NOT touchnext/**orplans/fed-sx-milestone-2.md(m2's loop owns those). Do NOT touchlib/erlang/**(Erlang substrate / loops/erlang owns that). - No-regression gate:
dune build bin/sx_server.exe(native) greenbash hosts/ocaml/browser/test_boot.sh(WASM kernel) greenbash lib/erlang/conformance.sh761/761bash next/tests/http_server_tcp.sh5/5
- WASM safety: Pattern A may need Thread / Mutex juggling
that isn't WASM-safe. The
http-listenprimitive is already native-only, so changes to its handler code don't need to build under WASM — but anything inlib/sx_runtime.mldoes. If your change has to addThread/Mutextolib/, you've picked the wrong fix; back out. - Builds are slow.
dune build≥600s timeout.conformance.sh≥400s.test_boot.sh≥60s. - Commit granularity: one fix, one commit. Title like:
fed-prims: release runtime mutex around gen_server:call (Blockers #4). - No
.sxedits. All work is.ml(or.shfor the regression test). sx-tree MCP is not needed. - Worktree: commit, push
origin/loops/fed-prims. Nevermain, neverarchitecture. The user merges to architecture separately.
What to write back
Append one dated line to plans/fed-sx-host-primitives.md's
Progress log (newest first):
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
deadlock). <one-sentence description of the fix>. Verified via
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
5/5 + conformance 761/761 + WASM boot.
Once landed, the fed-sx-m2 loop will pick up the fix on its next tick and unblock Step 12 — you don't need to coordinate.
If it's not Pattern A or Pattern B
If you discover the deadlock is something else entirely
(e.g., a gen_server config issue, a different lock in
Sx_runtime, a bug in er-load-gen-server!'s scheduler frame),
document what you found in a fresh Blockers entry on
plans/fed-sx-host-primitives.md and stop. The m2 loop will
re-check on its next tick. Do not invent a Pattern C without
clear evidence — the deadlock is reproducible and the two
patterns above cover the obvious fix shapes.
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.