fed-sx-m2: briefing for fed-prims mutex-deadlock fix loop
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The http-listen handler holds the SX runtime mutex; any gen_server:call from inside a route deadlocks because the gen_server reply scheduler needs the runtime the caller is sitting on. m2's Step 12 two-instance smoke test gates on this. Briefing pre-loads the fix-loop agent with: - Verified reproducer (deterministic curl-hang against http_server:start(P, [{kernel, nx_kernel}])) - Two fix-pattern candidates (release mutex around sx_call vs spawn handler in fresh er-process) - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel- aware request passes without hanging - Scope guardrails: only hosts/ocaml/bin/sx_server.ml + adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are OFF LIMITS Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims already exists (Phases A-J landed). This is a follow-up fix loop, not a continuation of the original phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
197
plans/agent-briefings/fed-prims-mutex-fix.md
Normal file
197
plans/agent-briefings/fed-prims-mutex-fix.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# fed-prims handler-mutex deadlock fix (one-shot)
|
||||
|
||||
Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s
|
||||
`http-listen` handler that blocks every `gen_server:call` from inside
|
||||
an Erlang route. Documented as **Blockers #4** in
|
||||
`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`.
|
||||
|
||||
```
|
||||
description: fed-prims handler-mutex deadlock fix
|
||||
subagent_type: general-purpose
|
||||
run_in_background: true
|
||||
isolation: worktree
|
||||
```
|
||||
|
||||
## Worktree + branch
|
||||
|
||||
Already provisioned at `/root/rose-ash-loops/fed-prims` on branch
|
||||
`loops/fed-prims` (the fed-prims phases A–J are landed; this is a
|
||||
follow-up fix). Start there. Never push to `main` or `architecture`.
|
||||
|
||||
If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/
|
||||
scheduled_tasks.lock` is dirty, just leave them alone — they're
|
||||
harness state. Stash if you must, but don't commit them.
|
||||
|
||||
## The problem (verified by fed-sx-m2 loop, 2026-06-07)
|
||||
|
||||
Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+`
|
||||
serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx`
|
||||
so the SX runtime isn't re-entered concurrently:
|
||||
|
||||
```ocaml
|
||||
Mutex.lock mtx;
|
||||
let resp =
|
||||
(try Sx_runtime.sx_call handler [Dict req]
|
||||
with e -> Mutex.unlock mtx; raise e) in
|
||||
Mutex.unlock mtx;
|
||||
```
|
||||
|
||||
When the Erlang handler does `gen_server:call(nx_kernel, ...)` from
|
||||
any kernel-aware route (`actor_doc_response_for/3`,
|
||||
`actor_outbox_response_for/3`, `handle_inbox_post`,
|
||||
`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX
|
||||
runtime scheduler to run — but the calling handler is sitting on the
|
||||
runtime mutex. Deadlock; curl hangs until `--max-time` fires.
|
||||
|
||||
**Verification recipe (reproduces deterministically):**
|
||||
|
||||
```bash
|
||||
PORT=51920
|
||||
cat > /tmp/boot.sx <<'SX'
|
||||
(epoch 1)
|
||||
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
|
||||
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
|
||||
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
|
||||
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
|
||||
(epoch 2)
|
||||
(eval "(er-load-gen-server!)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
|
||||
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
|
||||
(epoch 20)
|
||||
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
|
||||
SX
|
||||
mkfifo /tmp/fifo
|
||||
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
|
||||
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
|
||||
sleep 60 # boot takes ~30-45s cold
|
||||
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
|
||||
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
|
||||
```
|
||||
|
||||
The `next/kernel/*.erl` files referenced live in the fed-sx-m2
|
||||
worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can
|
||||
read them there for context but do NOT edit them — Erlang-side
|
||||
work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`.
|
||||
|
||||
## Two fix patterns
|
||||
|
||||
Pick **one**. Both are independent enough to evaluate alone; commit
|
||||
the one that lands first.
|
||||
|
||||
### Pattern A — release the mutex around the SX call
|
||||
|
||||
The mutex exists to serialise SX runtime mutation. But once the
|
||||
runtime hands the call off to the gen_server (which has its own
|
||||
scheduler frame), the calling thread is just waiting on a reply
|
||||
message; it doesn't need the mutex. The fix is to scope the mutex
|
||||
*only* over the runtime entry, not the entire handler invocation.
|
||||
|
||||
This may require restructuring `Sx_runtime.sx_call handler [Dict req]`
|
||||
so the call yields to the scheduler instead of blocking — verify by
|
||||
reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives).
|
||||
If `sx_call` is fully synchronous and re-entry is genuinely unsafe,
|
||||
fall back to Pattern B.
|
||||
|
||||
### Pattern B — spawn handler in a fresh er-process
|
||||
|
||||
Erlang processes already have their own scheduler frame. Have the
|
||||
handler closure trampoline through `er-spawn-fun` (or equivalent —
|
||||
check `lib/erlang/runtime.sx`'s existing process primitives) so the
|
||||
gen_server reply runs in a different frame from the http-listen
|
||||
accept-loop thread.
|
||||
|
||||
This may be cleaner if it can be done entirely at the SX/Erlang
|
||||
layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which
|
||||
case **this is m2 scope** and you should hand it back rather than
|
||||
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
|
||||
suffices, document that and stop without committing OCaml changes.
|
||||
|
||||
The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the
|
||||
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
|
||||
marshallers (commit `8d33d02f`). The handler is invoked inside
|
||||
`(http-listen port sx-handler)` — figure out whether you can
|
||||
`er-spawn-fun` around the body of `sx-handler` such that the
|
||||
spawned process's gen_server:call doesn't fight the parent's
|
||||
runtime mutex.
|
||||
|
||||
## Acceptance — the unblock target
|
||||
|
||||
`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple
|
||||
GET / + capabilities + 404 + 401 surface). PLUS:
|
||||
|
||||
A kernel-touching request over real HTTP must return without
|
||||
hanging. The minimal smoke for this is:
|
||||
|
||||
```bash
|
||||
# In the verification recipe above, after boot:
|
||||
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
|
||||
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
|
||||
# NOT a timeout.
|
||||
```
|
||||
|
||||
If you want a one-shot script, save the recipe above as a regression
|
||||
test inside the fed-prims worktree:
|
||||
`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it
|
||||
pass deterministically with a generous timeout (≥120s for the cold
|
||||
boot).
|
||||
|
||||
## Ground rules (hard)
|
||||
|
||||
- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent
|
||||
`hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is
|
||||
defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md`
|
||||
(m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang
|
||||
substrate / loops/erlang owns that).
|
||||
- **No-regression gate:**
|
||||
- `dune build bin/sx_server.exe` (native) green
|
||||
- `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green
|
||||
- `bash lib/erlang/conformance.sh` 761/761
|
||||
- `bash next/tests/http_server_tcp.sh` 5/5
|
||||
- **WASM safety:** Pattern A may need Thread / Mutex juggling
|
||||
that isn't WASM-safe. The `http-listen` primitive is already
|
||||
native-only, so changes to its handler code don't need to
|
||||
build under WASM — but anything in `lib/sx_runtime.ml` does.
|
||||
If your change has to add `Thread`/`Mutex` to `lib/`, you've
|
||||
picked the wrong fix; back out.
|
||||
- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh`
|
||||
≥400s. `test_boot.sh` ≥60s.
|
||||
- **Commit granularity:** one fix, one commit. Title like:
|
||||
`fed-prims: release runtime mutex around gen_server:call (Blockers #4)`.
|
||||
- **No `.sx` edits.** All work is `.ml` (or `.sh` for the
|
||||
regression test). sx-tree MCP is not needed.
|
||||
- **Worktree:** commit, push `origin/loops/fed-prims`. Never
|
||||
`main`, never `architecture`. The user merges to architecture
|
||||
separately.
|
||||
|
||||
## What to write back
|
||||
|
||||
Append one dated line to `plans/fed-sx-host-primitives.md`'s
|
||||
Progress log (newest first):
|
||||
|
||||
```
|
||||
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
|
||||
deadlock). <one-sentence description of the fix>. Verified via
|
||||
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
|
||||
5/5 + conformance 761/761 + WASM boot.
|
||||
```
|
||||
|
||||
Once landed, the fed-sx-m2 loop will pick up the fix on its next
|
||||
tick and unblock Step 12 — you don't need to coordinate.
|
||||
|
||||
## If it's not Pattern A or Pattern B
|
||||
|
||||
If you discover the deadlock is something else entirely
|
||||
(e.g., a gen_server config issue, a different lock in
|
||||
`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame),
|
||||
document what you found in a fresh Blockers entry on
|
||||
`plans/fed-sx-host-primitives.md` and stop. The m2 loop will
|
||||
re-check on its next tick. **Do not invent a Pattern C without
|
||||
clear evidence** — the deadlock is reproducible and the two
|
||||
patterns above cover the obvious fix shapes.
|
||||
|
||||
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.
|
||||
Reference in New Issue
Block a user