fed-sx-m2: briefing for fed-prims mutex-deadlock fix loop
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s

Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The
http-listen handler holds the SX runtime mutex; any gen_server:call
from inside a route deadlocks because the gen_server reply
scheduler needs the runtime the caller is sitting on. m2's Step 12
two-instance smoke test gates on this.

Briefing pre-loads the fix-loop agent with:
  - Verified reproducer (deterministic curl-hang against
    http_server:start(P, [{kernel, nx_kernel}]))
  - Two fix-pattern candidates (release mutex around sx_call vs
    spawn handler in fresh er-process)
  - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel-
    aware request passes without hanging
  - Scope guardrails: only hosts/ocaml/bin/sx_server.ml +
    adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are
    OFF LIMITS

Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims
already exists (Phases A-J landed). This is a follow-up fix loop,
not a continuation of the original phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 14:06:15 +00:00
parent eafb687b53
commit 136deb1daf

View File

@@ -0,0 +1,197 @@
# fed-prims handler-mutex deadlock fix (one-shot)
Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s
`http-listen` handler that blocks every `gen_server:call` from inside
an Erlang route. Documented as **Blockers #4** in
`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`.
```
description: fed-prims handler-mutex deadlock fix
subagent_type: general-purpose
run_in_background: true
isolation: worktree
```
## Worktree + branch
Already provisioned at `/root/rose-ash-loops/fed-prims` on branch
`loops/fed-prims` (the fed-prims phases AJ are landed; this is a
follow-up fix). Start there. Never push to `main` or `architecture`.
If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/
scheduled_tasks.lock` is dirty, just leave them alone — they're
harness state. Stash if you must, but don't commit them.
## The problem (verified by fed-sx-m2 loop, 2026-06-07)
Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+`
serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx`
so the SX runtime isn't re-entered concurrently:
```ocaml
Mutex.lock mtx;
let resp =
(try Sx_runtime.sx_call handler [Dict req]
with e -> Mutex.unlock mtx; raise e) in
Mutex.unlock mtx;
```
When the Erlang handler does `gen_server:call(nx_kernel, ...)` from
any kernel-aware route (`actor_doc_response_for/3`,
`actor_outbox_response_for/3`, `handle_inbox_post`,
`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX
runtime scheduler to run — but the calling handler is sitting on the
runtime mutex. Deadlock; curl hangs until `--max-time` fires.
**Verification recipe (reproduces deterministically):**
```bash
PORT=51920
cat > /tmp/boot.sx <<'SX'
(epoch 1)
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
(epoch 2)
(eval "(er-load-gen-server!)")
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
(epoch 20)
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
SX
mkfifo /tmp/fifo
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
sleep 60 # boot takes ~30-45s cold
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
```
The `next/kernel/*.erl` files referenced live in the fed-sx-m2
worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can
read them there for context but do NOT edit them — Erlang-side
work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`.
## Two fix patterns
Pick **one**. Both are independent enough to evaluate alone; commit
the one that lands first.
### Pattern A — release the mutex around the SX call
The mutex exists to serialise SX runtime mutation. But once the
runtime hands the call off to the gen_server (which has its own
scheduler frame), the calling thread is just waiting on a reply
message; it doesn't need the mutex. The fix is to scope the mutex
*only* over the runtime entry, not the entire handler invocation.
This may require restructuring `Sx_runtime.sx_call handler [Dict req]`
so the call yields to the scheduler instead of blocking — verify by
reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives).
If `sx_call` is fully synchronous and re-entry is genuinely unsafe,
fall back to Pattern B.
### Pattern B — spawn handler in a fresh er-process
Erlang processes already have their own scheduler frame. Have the
handler closure trampoline through `er-spawn-fun` (or equivalent —
check `lib/erlang/runtime.sx`'s existing process primitives) so the
gen_server reply runs in a different frame from the http-listen
accept-loop thread.
This may be cleaner if it can be done entirely at the SX/Erlang
layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which
case **this is m2 scope** and you should hand it back rather than
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
suffices, document that and stop without committing OCaml changes.
The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
marshallers (commit `8d33d02f`). The handler is invoked inside
`(http-listen port sx-handler)` — figure out whether you can
`er-spawn-fun` around the body of `sx-handler` such that the
spawned process's gen_server:call doesn't fight the parent's
runtime mutex.
## Acceptance — the unblock target
`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple
GET / + capabilities + 404 + 401 surface). PLUS:
A kernel-touching request over real HTTP must return without
hanging. The minimal smoke for this is:
```bash
# In the verification recipe above, after boot:
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
# NOT a timeout.
```
If you want a one-shot script, save the recipe above as a regression
test inside the fed-prims worktree:
`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it
pass deterministically with a generous timeout (≥120s for the cold
boot).
## Ground rules (hard)
- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent
`hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is
defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md`
(m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang
substrate / loops/erlang owns that).
- **No-regression gate:**
- `dune build bin/sx_server.exe` (native) green
- `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green
- `bash lib/erlang/conformance.sh` 761/761
- `bash next/tests/http_server_tcp.sh` 5/5
- **WASM safety:** Pattern A may need Thread / Mutex juggling
that isn't WASM-safe. The `http-listen` primitive is already
native-only, so changes to its handler code don't need to
build under WASM — but anything in `lib/sx_runtime.ml` does.
If your change has to add `Thread`/`Mutex` to `lib/`, you've
picked the wrong fix; back out.
- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh`
≥400s. `test_boot.sh` ≥60s.
- **Commit granularity:** one fix, one commit. Title like:
`fed-prims: release runtime mutex around gen_server:call (Blockers #4)`.
- **No `.sx` edits.** All work is `.ml` (or `.sh` for the
regression test). sx-tree MCP is not needed.
- **Worktree:** commit, push `origin/loops/fed-prims`. Never
`main`, never `architecture`. The user merges to architecture
separately.
## What to write back
Append one dated line to `plans/fed-sx-host-primitives.md`'s
Progress log (newest first):
```
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
deadlock). <one-sentence description of the fix>. Verified via
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
5/5 + conformance 761/761 + WASM boot.
```
Once landed, the fed-sx-m2 loop will pick up the fix on its next
tick and unblock Step 12 — you don't need to coordinate.
## If it's not Pattern A or Pattern B
If you discover the deadlock is something else entirely
(e.g., a gen_server config issue, a different lock in
`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame),
document what you found in a fresh Blockers entry on
`plans/fed-sx-host-primitives.md` and stop. The m2 loop will
re-check on its next tick. **Do not invent a Pattern C without
clear evidence** — the deadlock is reproducible and the two
patterns above cover the obvious fix shapes.
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.