Files
rose-ash/plans/agent-briefings/fed-prims-mutex-fix.md
giles 136deb1daf
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
fed-sx-m2: briefing for fed-prims mutex-deadlock fix loop
Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The
http-listen handler holds the SX runtime mutex; any gen_server:call
from inside a route deadlocks because the gen_server reply
scheduler needs the runtime the caller is sitting on. m2's Step 12
two-instance smoke test gates on this.

Briefing pre-loads the fix-loop agent with:
  - Verified reproducer (deterministic curl-hang against
    http_server:start(P, [{kernel, nx_kernel}]))
  - Two fix-pattern candidates (release mutex around sx_call vs
    spawn handler in fresh er-process)
  - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel-
    aware request passes without hanging
  - Scope guardrails: only hosts/ocaml/bin/sx_server.ml +
    adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are
    OFF LIMITS

Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims
already exists (Phases A-J landed). This is a follow-up fix loop,
not a continuation of the original phase plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 14:06:15 +00:00

198 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# fed-prims handler-mutex deadlock fix (one-shot)
Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s
`http-listen` handler that blocks every `gen_server:call` from inside
an Erlang route. Documented as **Blockers #4** in
`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`.
```
description: fed-prims handler-mutex deadlock fix
subagent_type: general-purpose
run_in_background: true
isolation: worktree
```
## Worktree + branch
Already provisioned at `/root/rose-ash-loops/fed-prims` on branch
`loops/fed-prims` (the fed-prims phases AJ are landed; this is a
follow-up fix). Start there. Never push to `main` or `architecture`.
If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/
scheduled_tasks.lock` is dirty, just leave them alone — they're
harness state. Stash if you must, but don't commit them.
## The problem (verified by fed-sx-m2 loop, 2026-06-07)
Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+`
serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx`
so the SX runtime isn't re-entered concurrently:
```ocaml
Mutex.lock mtx;
let resp =
(try Sx_runtime.sx_call handler [Dict req]
with e -> Mutex.unlock mtx; raise e) in
Mutex.unlock mtx;
```
When the Erlang handler does `gen_server:call(nx_kernel, ...)` from
any kernel-aware route (`actor_doc_response_for/3`,
`actor_outbox_response_for/3`, `handle_inbox_post`,
`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX
runtime scheduler to run — but the calling handler is sitting on the
runtime mutex. Deadlock; curl hangs until `--max-time` fires.
**Verification recipe (reproduces deterministically):**
```bash
PORT=51920
cat > /tmp/boot.sx <<'SX'
(epoch 1)
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
(epoch 2)
(eval "(er-load-gen-server!)")
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
(epoch 20)
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
SX
mkfifo /tmp/fifo
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
sleep 60 # boot takes ~30-45s cold
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
```
The `next/kernel/*.erl` files referenced live in the fed-sx-m2
worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can
read them there for context but do NOT edit them — Erlang-side
work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`.
## Two fix patterns
Pick **one**. Both are independent enough to evaluate alone; commit
the one that lands first.
### Pattern A — release the mutex around the SX call
The mutex exists to serialise SX runtime mutation. But once the
runtime hands the call off to the gen_server (which has its own
scheduler frame), the calling thread is just waiting on a reply
message; it doesn't need the mutex. The fix is to scope the mutex
*only* over the runtime entry, not the entire handler invocation.
This may require restructuring `Sx_runtime.sx_call handler [Dict req]`
so the call yields to the scheduler instead of blocking — verify by
reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives).
If `sx_call` is fully synchronous and re-entry is genuinely unsafe,
fall back to Pattern B.
### Pattern B — spawn handler in a fresh er-process
Erlang processes already have their own scheduler frame. Have the
handler closure trampoline through `er-spawn-fun` (or equivalent —
check `lib/erlang/runtime.sx`'s existing process primitives) so the
gen_server reply runs in a different frame from the http-listen
accept-loop thread.
This may be cleaner if it can be done entirely at the SX/Erlang
layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which
case **this is m2 scope** and you should hand it back rather than
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
suffices, document that and stop without committing OCaml changes.
The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
marshallers (commit `8d33d02f`). The handler is invoked inside
`(http-listen port sx-handler)` — figure out whether you can
`er-spawn-fun` around the body of `sx-handler` such that the
spawned process's gen_server:call doesn't fight the parent's
runtime mutex.
## Acceptance — the unblock target
`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple
GET / + capabilities + 404 + 401 surface). PLUS:
A kernel-touching request over real HTTP must return without
hanging. The minimal smoke for this is:
```bash
# In the verification recipe above, after boot:
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
# NOT a timeout.
```
If you want a one-shot script, save the recipe above as a regression
test inside the fed-prims worktree:
`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it
pass deterministically with a generous timeout (≥120s for the cold
boot).
## Ground rules (hard)
- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent
`hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is
defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md`
(m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang
substrate / loops/erlang owns that).
- **No-regression gate:**
- `dune build bin/sx_server.exe` (native) green
- `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green
- `bash lib/erlang/conformance.sh` 761/761
- `bash next/tests/http_server_tcp.sh` 5/5
- **WASM safety:** Pattern A may need Thread / Mutex juggling
that isn't WASM-safe. The `http-listen` primitive is already
native-only, so changes to its handler code don't need to
build under WASM — but anything in `lib/sx_runtime.ml` does.
If your change has to add `Thread`/`Mutex` to `lib/`, you've
picked the wrong fix; back out.
- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh`
≥400s. `test_boot.sh` ≥60s.
- **Commit granularity:** one fix, one commit. Title like:
`fed-prims: release runtime mutex around gen_server:call (Blockers #4)`.
- **No `.sx` edits.** All work is `.ml` (or `.sh` for the
regression test). sx-tree MCP is not needed.
- **Worktree:** commit, push `origin/loops/fed-prims`. Never
`main`, never `architecture`. The user merges to architecture
separately.
## What to write back
Append one dated line to `plans/fed-sx-host-primitives.md`'s
Progress log (newest first):
```
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
deadlock). <one-sentence description of the fix>. Verified via
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
5/5 + conformance 761/761 + WASM boot.
```
Once landed, the fed-sx-m2 loop will pick up the fix on its next
tick and unblock Step 12 — you don't need to coordinate.
## If it's not Pattern A or Pattern B
If you discover the deadlock is something else entirely
(e.g., a gen_server config issue, a different lock in
`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame),
document what you found in a fresh Blockers entry on
`plans/fed-sx-host-primitives.md` and stop. The m2 loop will
re-check on its next tick. **Do not invent a Pattern C without
clear evidence** — the deadlock is reproducible and the two
patterns above cover the obvious fix shapes.
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.