Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
Pairs with Blockers #4 in plans/fed-sx-milestone-2.md. The http-listen handler holds the SX runtime mutex; any gen_server:call from inside a route deadlocks because the gen_server reply scheduler needs the runtime the caller is sitting on. m2's Step 12 two-instance smoke test gates on this. Briefing pre-loads the fix-loop agent with: - Verified reproducer (deterministic curl-hang against http_server:start(P, [{kernel, nx_kernel}])) - Two fix-pattern candidates (release mutex around sx_call vs spawn handler in fresh er-process) - Acceptance criteria: http_server_tcp.sh 5/5 + a NEW kernel- aware request passes without hanging - Scope guardrails: only hosts/ocaml/bin/sx_server.ml + adjacent lib/sx_runtime.ml; m2's next/** and lib/erlang/** are OFF LIMITS Worktree at /root/rose-ash-loops/fed-prims, branch loops/fed-prims already exists (Phases A-J landed). This is a follow-up fix loop, not a continuation of the original phase plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
198 lines
8.3 KiB
Markdown
198 lines
8.3 KiB
Markdown
# fed-prims handler-mutex deadlock fix (one-shot)
|
||
|
||
Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s
|
||
`http-listen` handler that blocks every `gen_server:call` from inside
|
||
an Erlang route. Documented as **Blockers #4** in
|
||
`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`.
|
||
|
||
```
|
||
description: fed-prims handler-mutex deadlock fix
|
||
subagent_type: general-purpose
|
||
run_in_background: true
|
||
isolation: worktree
|
||
```
|
||
|
||
## Worktree + branch
|
||
|
||
Already provisioned at `/root/rose-ash-loops/fed-prims` on branch
|
||
`loops/fed-prims` (the fed-prims phases A–J are landed; this is a
|
||
follow-up fix). Start there. Never push to `main` or `architecture`.
|
||
|
||
If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/
|
||
scheduled_tasks.lock` is dirty, just leave them alone — they're
|
||
harness state. Stash if you must, but don't commit them.
|
||
|
||
## The problem (verified by fed-sx-m2 loop, 2026-06-07)
|
||
|
||
Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+`
|
||
serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx`
|
||
so the SX runtime isn't re-entered concurrently:
|
||
|
||
```ocaml
|
||
Mutex.lock mtx;
|
||
let resp =
|
||
(try Sx_runtime.sx_call handler [Dict req]
|
||
with e -> Mutex.unlock mtx; raise e) in
|
||
Mutex.unlock mtx;
|
||
```
|
||
|
||
When the Erlang handler does `gen_server:call(nx_kernel, ...)` from
|
||
any kernel-aware route (`actor_doc_response_for/3`,
|
||
`actor_outbox_response_for/3`, `handle_inbox_post`,
|
||
`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX
|
||
runtime scheduler to run — but the calling handler is sitting on the
|
||
runtime mutex. Deadlock; curl hangs until `--max-time` fires.
|
||
|
||
**Verification recipe (reproduces deterministically):**
|
||
|
||
```bash
|
||
PORT=51920
|
||
cat > /tmp/boot.sx <<'SX'
|
||
(epoch 1)
|
||
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
|
||
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
|
||
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
|
||
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
|
||
(epoch 2)
|
||
(eval "(er-load-gen-server!)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
|
||
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
|
||
(epoch 20)
|
||
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
|
||
SX
|
||
mkfifo /tmp/fifo
|
||
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
|
||
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
|
||
sleep 60 # boot takes ~30-45s cold
|
||
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
|
||
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
|
||
```
|
||
|
||
The `next/kernel/*.erl` files referenced live in the fed-sx-m2
|
||
worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can
|
||
read them there for context but do NOT edit them — Erlang-side
|
||
work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`.
|
||
|
||
## Two fix patterns
|
||
|
||
Pick **one**. Both are independent enough to evaluate alone; commit
|
||
the one that lands first.
|
||
|
||
### Pattern A — release the mutex around the SX call
|
||
|
||
The mutex exists to serialise SX runtime mutation. But once the
|
||
runtime hands the call off to the gen_server (which has its own
|
||
scheduler frame), the calling thread is just waiting on a reply
|
||
message; it doesn't need the mutex. The fix is to scope the mutex
|
||
*only* over the runtime entry, not the entire handler invocation.
|
||
|
||
This may require restructuring `Sx_runtime.sx_call handler [Dict req]`
|
||
so the call yields to the scheduler instead of blocking — verify by
|
||
reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives).
|
||
If `sx_call` is fully synchronous and re-entry is genuinely unsafe,
|
||
fall back to Pattern B.
|
||
|
||
### Pattern B — spawn handler in a fresh er-process
|
||
|
||
Erlang processes already have their own scheduler frame. Have the
|
||
handler closure trampoline through `er-spawn-fun` (or equivalent —
|
||
check `lib/erlang/runtime.sx`'s existing process primitives) so the
|
||
gen_server reply runs in a different frame from the http-listen
|
||
accept-loop thread.
|
||
|
||
This may be cleaner if it can be done entirely at the SX/Erlang
|
||
layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which
|
||
case **this is m2 scope** and you should hand it back rather than
|
||
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
|
||
suffices, document that and stop without committing OCaml changes.
|
||
|
||
The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the
|
||
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
|
||
marshallers (commit `8d33d02f`). The handler is invoked inside
|
||
`(http-listen port sx-handler)` — figure out whether you can
|
||
`er-spawn-fun` around the body of `sx-handler` such that the
|
||
spawned process's gen_server:call doesn't fight the parent's
|
||
runtime mutex.
|
||
|
||
## Acceptance — the unblock target
|
||
|
||
`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple
|
||
GET / + capabilities + 404 + 401 surface). PLUS:
|
||
|
||
A kernel-touching request over real HTTP must return without
|
||
hanging. The minimal smoke for this is:
|
||
|
||
```bash
|
||
# In the verification recipe above, after boot:
|
||
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
|
||
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
|
||
# NOT a timeout.
|
||
```
|
||
|
||
If you want a one-shot script, save the recipe above as a regression
|
||
test inside the fed-prims worktree:
|
||
`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it
|
||
pass deterministically with a generous timeout (≥120s for the cold
|
||
boot).
|
||
|
||
## Ground rules (hard)
|
||
|
||
- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent
|
||
`hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is
|
||
defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md`
|
||
(m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang
|
||
substrate / loops/erlang owns that).
|
||
- **No-regression gate:**
|
||
- `dune build bin/sx_server.exe` (native) green
|
||
- `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green
|
||
- `bash lib/erlang/conformance.sh` 761/761
|
||
- `bash next/tests/http_server_tcp.sh` 5/5
|
||
- **WASM safety:** Pattern A may need Thread / Mutex juggling
|
||
that isn't WASM-safe. The `http-listen` primitive is already
|
||
native-only, so changes to its handler code don't need to
|
||
build under WASM — but anything in `lib/sx_runtime.ml` does.
|
||
If your change has to add `Thread`/`Mutex` to `lib/`, you've
|
||
picked the wrong fix; back out.
|
||
- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh`
|
||
≥400s. `test_boot.sh` ≥60s.
|
||
- **Commit granularity:** one fix, one commit. Title like:
|
||
`fed-prims: release runtime mutex around gen_server:call (Blockers #4)`.
|
||
- **No `.sx` edits.** All work is `.ml` (or `.sh` for the
|
||
regression test). sx-tree MCP is not needed.
|
||
- **Worktree:** commit, push `origin/loops/fed-prims`. Never
|
||
`main`, never `architecture`. The user merges to architecture
|
||
separately.
|
||
|
||
## What to write back
|
||
|
||
Append one dated line to `plans/fed-sx-host-primitives.md`'s
|
||
Progress log (newest first):
|
||
|
||
```
|
||
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
|
||
deadlock). <one-sentence description of the fix>. Verified via
|
||
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
|
||
5/5 + conformance 761/761 + WASM boot.
|
||
```
|
||
|
||
Once landed, the fed-sx-m2 loop will pick up the fix on its next
|
||
tick and unblock Step 12 — you don't need to coordinate.
|
||
|
||
## If it's not Pattern A or Pattern B
|
||
|
||
If you discover the deadlock is something else entirely
|
||
(e.g., a gen_server config issue, a different lock in
|
||
`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame),
|
||
document what you found in a fresh Blockers entry on
|
||
`plans/fed-sx-host-primitives.md` and stop. The m2 loop will
|
||
re-check on its next tick. **Do not invent a Pattern C without
|
||
clear evidence** — the deadlock is reproducible and the two
|
||
patterns above cover the obvious fix shapes.
|
||
|
||
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.
|