Files
rose-ash/plans/agent-briefings/fed-prims-mutex-fix.md
giles 136deb1daf
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
fed-sx-m2: briefing for fed-prims mutex-deadlock fix loop
Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The
http-listen handler holds the SX runtime mutex; any gen_server:call
from inside a route deadlocks because the gen_server reply
scheduler needs the runtime the caller is sitting on. m2's Step 12
two-instance smoke test gates on this.

Briefing pre-loads the fix-loop agent with:
  - Verified reproducer (deterministic curl-hang against
    http_server:start(P, [{kernel, nx_kernel}]))
  - Two fix-pattern candidates (release mutex around sx_call vs
    spawn handler in fresh er-process)
  - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel-
    aware request passes without hanging
  - Scope guardrails: only hosts/ocaml/bin/sx_server.ml +
    adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are
    OFF LIMITS

Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims
already exists (Phases A-J landed). This is a follow-up fix loop,
not a continuation of the original phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 14:06:15 +00:00

8.3 KiB
Raw Blame History

fed-prims handler-mutex deadlock fix (one-shot)

Role: fix the SX runtime mutex deadlock in bin/sx_server.ml's http-listen handler that blocks every gen_server:call from inside an Erlang route. Documented as Blockers #4 in /root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md.

description: fed-prims handler-mutex deadlock fix
subagent_type: general-purpose
run_in_background: true
isolation: worktree

Worktree + branch

Already provisioned at /root/rose-ash-loops/fed-prims on branch loops/fed-prims (the fed-prims phases AJ are landed; this is a follow-up fix). Start there. Never push to main or architecture.

If .mcp.json shows a non-absolute mcp_tree path or .claude/ scheduled_tasks.lock is dirty, just leave them alone — they're harness state. Stash if you must, but don't commit them.

The problem (verified by fed-sx-m2 loop, 2026-06-07)

Native http-listen in hosts/ocaml/bin/sx_server.ml:735+ serialises handler calls with Mutex.lock mtx / Mutex.unlock mtx so the SX runtime isn't re-entered concurrently:

Mutex.lock mtx;
let resp =
  (try Sx_runtime.sx_call handler [Dict req]
   with e -> Mutex.unlock mtx; raise e) in
Mutex.unlock mtx;

When the Erlang handler does gen_server:call(nx_kernel, ...) from any kernel-aware route (actor_doc_response_for/3, actor_outbox_response_for/3, handle_inbox_post, nx_kernel:state_for/1, etc.), the gen_server's reply needs the SX runtime scheduler to run — but the calling handler is sitting on the runtime mutex. Deadlock; curl hangs until --max-time fires.

Verification recipe (reproduces deterministically):

PORT=51920
cat > /tmp/boot.sx <<'SX'
(epoch 1)
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
(epoch 2)
(eval "(er-load-gen-server!)")
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
(epoch 20)
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
SX
mkfifo /tmp/fifo
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
sleep 60  # boot takes ~30-45s cold
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null              # OK: 200
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"      # HANGS

The next/kernel/*.erl files referenced live in the fed-sx-m2 worktree at /root/rose-ash-loops/fed-sx-m1/next/kernel/. You can read them there for context but do NOT edit them — Erlang-side work is m2's loop. This loop only touches hosts/ocaml/bin/sx_server.ml.

Two fix patterns

Pick one. Both are independent enough to evaluate alone; commit the one that lands first.

Pattern A — release the mutex around the SX call

The mutex exists to serialise SX runtime mutation. But once the runtime hands the call off to the gen_server (which has its own scheduler frame), the calling thread is just waiting on a reply message; it doesn't need the mutex. The fix is to scope the mutex only over the runtime entry, not the entire handler invocation.

This may require restructuring Sx_runtime.sx_call handler [Dict req] so the call yields to the scheduler instead of blocking — verify by reading hosts/ocaml/lib/sx_runtime.ml (or wherever sx_call lives). If sx_call is fully synchronous and re-entry is genuinely unsafe, fall back to Pattern B.

Pattern B — spawn handler in a fresh er-process

Erlang processes already have their own scheduler frame. Have the handler closure trampoline through er-spawn-fun (or equivalent — check lib/erlang/runtime.sx's existing process primitives) so the gen_server reply runs in a different frame from the http-listen accept-loop thread.

This may be cleaner if it can be done entirely at the SX/Erlang layer (in er-bif-http-listen in lib/erlang/runtime.sx), in which case this is m2 scope and you should hand it back rather than edit OCaml. Read the BIF body first — if a pure-Erlang spawn suffices, document that and stop without committing OCaml changes.

The BIF body is at lib/erlang/runtime.sx:1581-1632 (in the fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound marshallers (commit 8d33d02f). The handler is invoked inside (http-listen port sx-handler) — figure out whether you can er-spawn-fun around the body of sx-handler such that the spawned process's gen_server:call doesn't fight the parent's runtime mutex.

Acceptance — the unblock target

next/tests/http_server_tcp.sh 5/5 stays green (the existing simple GET / + capabilities + 404 + 401 surface). PLUS:

A kernel-touching request over real HTTP must return without hanging. The minimal smoke for this is:

# In the verification recipe above, after boot:
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
# NOT a timeout.

If you want a one-shot script, save the recipe above as a regression test inside the fed-prims worktree: hosts/ocaml/test/handler_kernel_unblock.sh (new file). Make it pass deterministically with a generous timeout (≥120s for the cold boot).

Ground rules (hard)

  • Scope: hosts/ocaml/bin/sx_server.ml and adjacent hosts/ocaml/lib/sx_runtime.ml (or wherever sx_call is defined). Do NOT touch next/** or plans/fed-sx-milestone-2.md (m2's loop owns those). Do NOT touch lib/erlang/** (Erlang substrate / loops/erlang owns that).
  • No-regression gate:
    • dune build bin/sx_server.exe (native) green
    • bash hosts/ocaml/browser/test_boot.sh (WASM kernel) green
    • bash lib/erlang/conformance.sh 761/761
    • bash next/tests/http_server_tcp.sh 5/5
  • WASM safety: Pattern A may need Thread / Mutex juggling that isn't WASM-safe. The http-listen primitive is already native-only, so changes to its handler code don't need to build under WASM — but anything in lib/sx_runtime.ml does. If your change has to add Thread/Mutex to lib/, you've picked the wrong fix; back out.
  • Builds are slow. dune build ≥600s timeout. conformance.sh ≥400s. test_boot.sh ≥60s.
  • Commit granularity: one fix, one commit. Title like: fed-prims: release runtime mutex around gen_server:call (Blockers #4).
  • No .sx edits. All work is .ml (or .sh for the regression test). sx-tree MCP is not needed.
  • Worktree: commit, push origin/loops/fed-prims. Never main, never architecture. The user merges to architecture separately.

What to write back

Append one dated line to plans/fed-sx-host-primitives.md's Progress log (newest first):

- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
  deadlock). <one-sentence description of the fix>. Verified via
  hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
  5/5 + conformance 761/761 + WASM boot.

Once landed, the fed-sx-m2 loop will pick up the fix on its next tick and unblock Step 12 — you don't need to coordinate.

If it's not Pattern A or Pattern B

If you discover the deadlock is something else entirely (e.g., a gen_server config issue, a different lock in Sx_runtime, a bug in er-load-gen-server!'s scheduler frame), document what you found in a fresh Blockers entry on plans/fed-sx-host-primitives.md and stop. The m2 loop will re-check on its next tick. Do not invent a Pattern C without clear evidence — the deadlock is reproducible and the two patterns above cover the obvious fix shapes.

Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.