fed-sx-m2: Step 12 gated on new Blockers #4 (handler mutex deadlock)
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 22s

Step 12 prep tried to build the two-instance smoke test on top of
the now-resolved Blockers #1 fix (http-listen marshaller bridge).
Both sx_server instances boot and bind, GET / returns the welcome
body, but every request that touches the kernel hangs past curl's
--max-time.

Root cause (verified): the native `http-listen` primitive in
bin/sx_server.ml serialises handler calls with Mutex.lock /
Mutex.unlock so the SX runtime isn't re-entered concurrently. The
wrapped Erlang handler eventually does gen_server:call(nx_kernel,
...) for any kernel-aware route (actor_doc_response_for/3,
actor_outbox_response_for/3, handle_inbox_post, etc.); the
gen_server reply needs the scheduler to run, which needs the SX
runtime, which is locked by the calling handler. Deadlock.

Verification: a sx_server with
  http_server:start(P, [])
serves GET / and welcome routes fine; the same instance with
  http_server:start(P, [{kernel, nx_kernel}])
hangs on the first GET /actors/<id>/outbox.

Blockers #4 entry added. Two fix patterns documented (release the
mutex around gen_server:call's reply wait; OR run the handler in a
fresh er-spawn'd process). Belongs on loops/erlang or
loops/fed-prims — substrate-level, not m2.

Step 12 header updated to flag the gate. Withdrew the in-flight
smoke_federate.sh — its framework was correct (two instances
boot, sequential GET / proves the listener survives more than one
request) but Step 12's actual proof point — Follow → Accept → Note
fan-out — requires kernel-touching routes on every request.

m2's other 11 steps stay individually proven by their per-step
suites; this loop has reached its substrate ceiling and the
autonomous pace is dialled down accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 14:03:37 +00:00
parent 8d33d02f92
commit eafb687b53

View File

@@ -851,6 +851,15 @@ re-broadcast another actor's content to their own followers.
## Step 12 — Two-instance smoke test
**GATED on Blockers #4** (http-listen handler holds the SX runtime
mutex, deadlocking any `gen_server:call` from inside a route — see
Blockers section for verification + fix patterns). Without this,
the only request shapes that survive over real HTTP are the static /
capabilities / static-stub paths; every kernel-aware route hangs
indefinitely. The smoke test framework is sketched out (see the
withdrawn `smoke_federate.sh` in this loop's history at commit
`8d33d02f`'s tree state) but cannot exit 0 until Blockers #4 lifts.
**The proof point.** `next/tests/smoke_federate.sh` spins up two kernel
instances on distinct ports, walks them through the full federation
flow, and exits 0.
@@ -1076,12 +1085,72 @@ proceed.
retry semantics pure-functionally in 8b-pure so 8b-timer
becomes a 1-shot wiring when the primitive lands.
4. **`http-listen` handler holds the SX runtime mutex →
`gen_server:call` from inside an HTTP route deadlocks.** —
discovered during Step 12 prep. The native `http-listen`
primitive in `bin/sx_server.ml:735+` serialises handler calls
with `Mutex.lock mtx` / `Mutex.unlock mtx` so the SX runtime
isn't re-entered concurrently. The wrapped Erlang handler
eventually does `gen_server:call(nx_kernel, ...)` (for kernel-
aware routes like `actor_doc_response_for/3`,
`actor_outbox_response_for/3`, `handle_inbox_post`,
`nx_kernel:state_for/1`, etc.); the gen_server reply needs the
scheduler to run, which needs the SX runtime, which is locked
by the calling handler. Deadlock — curl hangs until the test
`--max-time` fires.
Verification: a sx_server with `http_server:start(P, [])` (no
Cfg, no kernel routes) serves GET / and welcome paths fine;
the same instance with `Cfg = [{kernel, nx_kernel}]` hangs on
the first GET /actors/<id>/outbox (or any /actors/<id> with
`Accept: application/vnd.fed-sx.actor-doc`).
Belongs on `loops/erlang` or `loops/fed-prims`. Two fix
patterns:
- Release the mutex around the `gen_server:call` reply wait
(substrate change in http-listen's handler-call code).
- Run the handler in a fresh er-spawn'd process so the
gen_server runs on a different scheduler frame.
Step 12's two-instance smoke test gates on this — without
it, the only request shapes that survive over real HTTP are
the static / capabilities / static-stub paths.
In-flight `smoke_federate.sh` test was withdrawn during this
tick after the deadlock surfaced (it boots both instances
successfully but every kernel-touching request hangs); the
plan's Step 12 acceptance criterion stays open pending
Blockers #4 resolution. m2's other 11 steps are fully
landed and individually proven by their per-step suites.
---
## Progress log
Newest first.
- **2026-06-07** — Step 12 prep discovered Blockers #4
(http-listen handler holds the SX runtime mutex; any
`gen_server:call` from inside an HTTP route deadlocks
because the gen_server reply scheduler needs the SX runtime
the calling handler is sitting on). Verified by spinning
up a single `http_server:start(P, [{kernel, nx_kernel}])`
instance: GET / works, GET /actors/alice (text) works
(no gen_server touch), but GET /actors/alice/outbox or
GET /actors/alice with `Accept: application/vnd.fed-sx.
actor-doc` both hang past curl's --max-time. m2's Step 12
acceptance gates on this — its proof-point is the
two-instance smoke test which walks the full Follow →
Accept → Note fan-out path, and every step touches the
kernel via gen_server. The in-flight `smoke_federate.sh`
was withdrawn (boots both instances + serves welcome
routes successfully, but every kernel-aware request hangs);
Blockers #4 entry documents the substrate-level fix
patterns. m2's other 11 steps remain individually proven
by their per-step suites. Pivot: pacing the autonomous
loop down — substrate work is owed to `loops/erlang` or
`loops/fed-prims`, not m2.
- **2026-06-07** — Blockers #1 RESOLVED. The
`er-bif-http-listen` sx-handler in `lib/erlang/runtime.sx`
referenced `er-http-resp-to-sx` / `er-http-req-of-sx` —