fed-sx-m2: Step 12 gated on new Blockers #4 (handler mutex deadlock)
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 22s

Step 12 prep tried to build the two-instance smoke test on top of
the now-resolved Blockers #1 fix (http-listen marshaller bridge).
Both sx_server instances boot and bind, GET / returns the welcome
body, but every request that touches the kernel hangs past curl's
--max-time.

Root cause (verified): the native `http-listen` primitive in
bin/sx_server.ml serialises handler calls with Mutex.lock /
Mutex.unlock so the SX runtime isn't re-entered concurrently. The
wrapped Erlang handler eventually does gen_server:call(nx_kernel,
...) for any kernel-aware route (actor_doc_response_for/3,
actor_outbox_response_for/3, handle_inbox_post, etc.); the
gen_server reply needs the scheduler to run, which needs the SX
runtime, which is locked by the calling handler. Deadlock.

Verification: a sx_server with
  http_server:start(P, [])
serves GET / and welcome routes fine; the same instance with
  http_server:start(P, [{kernel, nx_kernel}])
hangs on the first GET /actors/<id>/outbox.

Blockers #4 entry added. Two fix patterns documented (release the
mutex around gen_server:call's reply wait; OR run the handler in a
fresh er-spawn'd process). Belongs on loops/erlang or
loops/fed-prims — substrate-level, not m2.

Step 12 header updated to flag the gate. Withdrew the in-flight
smoke_federate.sh — its framework was correct (two instances
boot, sequential GET / proves the listener survives more than one
request) but Step 12's actual proof point — Follow → Accept → Note
fan-out — requires kernel-touching routes on every request.

m2's other 11 steps stay individually proven by their per-step
suites; this loop has reached its substrate ceiling and the
autonomous pace is dialled down accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 14:03:37 +00:00
parent 8d33d02f92
commit eafb687b53

View File

@@ -851,6 +851,15 @@ re-broadcast another actor's content to their own followers.
## Step 12 — Two-instance smoke test ## Step 12 — Two-instance smoke test
**GATED on Blockers #4** (http-listen handler holds the SX runtime
mutex, deadlocking any `gen_server:call` from inside a route — see
Blockers section for verification + fix patterns). Without this,
the only request shapes that survive over real HTTP are the static /
capabilities / static-stub paths; every kernel-aware route hangs
indefinitely. The smoke test framework is sketched out (see the
withdrawn `smoke_federate.sh` in this loop's history at commit
`8d33d02f`'s tree state) but cannot exit 0 until Blockers #4 lifts.
**The proof point.** `next/tests/smoke_federate.sh` spins up two kernel **The proof point.** `next/tests/smoke_federate.sh` spins up two kernel
instances on distinct ports, walks them through the full federation instances on distinct ports, walks them through the full federation
flow, and exits 0. flow, and exits 0.
@@ -1076,12 +1085,72 @@ proceed.
retry semantics pure-functionally in 8b-pure so 8b-timer retry semantics pure-functionally in 8b-pure so 8b-timer
becomes a 1-shot wiring when the primitive lands. becomes a 1-shot wiring when the primitive lands.
4. **`http-listen` handler holds the SX runtime mutex →
`gen_server:call` from inside an HTTP route deadlocks.** —
discovered during Step 12 prep. The native `http-listen`
primitive in `bin/sx_server.ml:735+` serialises handler calls
with `Mutex.lock mtx` / `Mutex.unlock mtx` so the SX runtime
isn't re-entered concurrently. The wrapped Erlang handler
eventually does `gen_server:call(nx_kernel, ...)` (for kernel-
aware routes like `actor_doc_response_for/3`,
`actor_outbox_response_for/3`, `handle_inbox_post`,
`nx_kernel:state_for/1`, etc.); the gen_server reply needs the
scheduler to run, which needs the SX runtime, which is locked
by the calling handler. Deadlock — curl hangs until the test
`--max-time` fires.
Verification: a sx_server with `http_server:start(P, [])` (no
Cfg, no kernel routes) serves GET / and welcome paths fine;
the same instance with `Cfg = [{kernel, nx_kernel}]` hangs on
the first GET /actors/<id>/outbox (or any /actors/<id> with
`Accept: application/vnd.fed-sx.actor-doc`).
Belongs on `loops/erlang` or `loops/fed-prims`. Two fix
patterns:
- Release the mutex around the `gen_server:call` reply wait
(substrate change in http-listen's handler-call code).
- Run the handler in a fresh er-spawn'd process so the
gen_server runs on a different scheduler frame.
Step 12's two-instance smoke test gates on this — without
it, the only request shapes that survive over real HTTP are
the static / capabilities / static-stub paths.
In-flight `smoke_federate.sh` test was withdrawn during this
tick after the deadlock surfaced (it boots both instances
successfully but every kernel-touching request hangs); the
plan's Step 12 acceptance criterion stays open pending
Blockers #4 resolution. m2's other 11 steps are fully
landed and individually proven by their per-step suites.
--- ---
## Progress log ## Progress log
Newest first. Newest first.
- **2026-06-07** — Step 12 prep discovered Blockers #4
(http-listen handler holds the SX runtime mutex; any
`gen_server:call` from inside an HTTP route deadlocks
because the gen_server reply scheduler needs the SX runtime
the calling handler is sitting on). Verified by spinning
up a single `http_server:start(P, [{kernel, nx_kernel}])`
instance: GET / works, GET /actors/alice (text) works
(no gen_server touch), but GET /actors/alice/outbox or
GET /actors/alice with `Accept: application/vnd.fed-sx.
actor-doc` both hang past curl's --max-time. m2's Step 12
acceptance gates on this — its proof-point is the
two-instance smoke test which walks the full Follow →
Accept → Note fan-out path, and every step touches the
kernel via gen_server. The in-flight `smoke_federate.sh`
was withdrawn (boots both instances + serves welcome
routes successfully, but every kernel-aware request hangs);
Blockers #4 entry documents the substrate-level fix
patterns. m2's other 11 steps remain individually proven
by their per-step suites. Pivot: pacing the autonomous
loop down — substrate work is owed to `loops/erlang` or
`loops/fed-prims`, not m2.
- **2026-06-07** — Blockers #1 RESOLVED. The - **2026-06-07** — Blockers #1 RESOLVED. The
`er-bif-http-listen` sx-handler in `lib/erlang/runtime.sx` `er-bif-http-listen` sx-handler in `lib/erlang/runtime.sx`
referenced `er-http-resp-to-sx` / `er-http-req-of-sx` — referenced `er-http-resp-to-sx` / `er-http-req-of-sx` —