From 03c32cda5f4673be3116f706dcd168b612f4b190 Mon Sep 17 00:00:00 2001 From: giles Date: Sun, 7 Jun 2026 20:04:19 +0000 Subject: [PATCH] =?UTF-8?q?fed-sx-m2:=20resolve=20Blockers=20#4=20?= =?UTF-8?q?=E2=80=94=20kernel=20routes=20now=20work=20over=20real=20HTTP?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Substrate fix: two-line change to lib/erlang/runtime.sx that lets http-listen handler routes call gen_server:call without deadlocking. 1. er-sched-step-alive!: pass :pending-args (when set) to the initial-fun call instead of always passing an empty list. Default behavior (no field) stays (list) — drop-in safe. 2. er-bif-http-listen sx-handler: instead of er-apply-fun handler inline (which blows up on receive's er-suspend-marker because the connection thread has no scheduler step on its stack), create a real er-process with :initial-fun = handler and :pending-args = (list req-pl), then er-sched-run-all! to drain. Any receive (e.g. gen_server:call) suspends + resumes inside the SX scheduler frame the process owns. Read :exit-result for the response proplist; marshal back to SX dict. Investigation arc (see plans/fed-sx-milestone-2.md Blockers #4 + Progress log): - loops/fed-prims bf8d0bf2 diagnosed it as Erlang-substrate, not OCaml mutex (Pattern A wrong, Pattern B right but sketchy). - First Pattern B attempt failed: tried er-spawn-fun on a raw SX lambda, hit (er-fun? fv) gate. Connection-thread bisect pinpointed the exact line. - Real fix: use the existing er-fun (user's handler) directly, but feed it via :pending-args so step-alive's hardcoded (list) doesn't drop the request arg. Acceptance: - new next/tests/smoke_kernel_route.sh: 6/6 over real HTTP (welcome /, /actors/alice, /actors/alice/outbox with gen_server-backed tip, /actors/alice/inbox, unknown-actor, via http_server:start(P, [{kernel, nx_kernel}])). - next/tests/http_server_tcp.sh: 5/5 (bumped wait_bound from 30s to 180s — cold boot is slow under sibling-loop CPU load and the per-handler scheduler ramp adds a small margin). - Erlang conformance: 761/761. Step 12's two-instance smoke test is now unblocked — its full Follow / Accept / Note flow can layer on top of this kernel-route surface. m2 plan updated. Pre-existing httpc_request.sh flakiness ("Undefined symbol: http-request" on the live-call epochs) reproduces WITHOUT this change — see git stash A/B in the investigation. Unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/erlang/runtime.sx | 29 +++++++- next/tests/http_server_tcp.sh | 6 +- next/tests/smoke_kernel_route.sh | 121 +++++++++++++++++++++++++++++++ plans/fed-sx-milestone-2.md | 41 ++++++++--- 4 files changed, 183 insertions(+), 14 deletions(-) create mode 100755 next/tests/smoke_kernel_route.sh diff --git a/lib/erlang/runtime.sx b/lib/erlang/runtime.sx index 484af858..d8fe6eca 100644 --- a/lib/erlang/runtime.sx +++ b/lib/erlang/runtime.sx @@ -731,7 +731,10 @@ 0 (if (= prev-k nil) - (er-apply-fun (er-proc-field pid :initial-fun) (list)) + (er-apply-fun + (er-proc-field pid :initial-fun) + (let ((args (er-proc-field pid :pending-args))) + (cond (= args nil) (list) :else args))) (do (er-proc-set! pid :continuation nil) (prev-k nil))))) (let ((r (nth result-ref 0))) @@ -1612,11 +1615,31 @@ ;; 78eae9ef deleted them as dead because the BIF body ;; still referenced them — Blockers #1. This rewrite ;; threads through the live marshallers instead.) + ;; Run the handler as a SCHEDULED er-process so any + ;; `receive` (e.g. gen_server:call inside a kernel-aware + ;; route) suspends and resumes inside the SX scheduler. + ;; Without this, native http-listen invokes the handler + ;; closure on a fresh OCaml thread that has no scheduler + ;; frame, so the receive's er-suspend-marker propagates + ;; out and the connection writes nothing — the Blockers + ;; #4 deadlock the m2 loop observed. + ;; + ;; er-spawn-fun requires an er-fun (Erlang-AST-shaped + ;; dict); handler IS one (created by user `fun (Req) -> + ;; route(Req, Cfg) end`). To feed req-pl as the call + ;; argument we stash it on the process record's + ;; :pending-args field — er-sched-step-alive! reads it + ;; on first step (the alternative was a host-closure-to- + ;; er-fun wrapper, which needs AST construction). ((sx-handler (fn (req-dict) (let ((req-pl (er-request-dict-to-proplist req-dict))) - (let ((resp-pl (er-apply-fun handler (list req-pl)))) - (er-proplist-to-dict resp-pl)))))) + (let ((proc (er-proc-new! (er-env-new)))) + (dict-set! proc :initial-fun handler) + (dict-set! proc :pending-args (list req-pl)) + (er-sched-run-all!) + (let ((resp-pl (er-proc-field (get proc :pid) :exit-result))) + (er-proplist-to-dict resp-pl))))))) (http-listen port sx-handler)))))) ;; httpc:request/4(Url, Method, Headers, Body) - BRIEFING-EXCEPTION: diff --git a/next/tests/http_server_tcp.sh b/next/tests/http_server_tcp.sh index 24ac72a0..0013f4fa 100755 --- a/next/tests/http_server_tcp.sh +++ b/next/tests/http_server_tcp.sh @@ -72,9 +72,11 @@ HOLDPID=$! SXPID=$! rm -f "$FIFO" # both ends still hold open via the running procs -# Wait for the listener to bind (up to ~30s — boot takes ~10s). +# Wait for the listener to bind (up to ~180s — cold boot can be slow +# under load from sibling loops, and the Blockers #4 :pending-args +# fix adds a small per-handler scheduler ramp). BOUND="" -for i in $(seq 1 60); do +for i in $(seq 1 360); do if (exec 3<>/dev/tcp/127.0.0.1/$PORT) 2>/dev/null; then exec 3<&-; exec 3>&- BOUND="yes" diff --git a/next/tests/smoke_kernel_route.sh b/next/tests/smoke_kernel_route.sh new file mode 100755 index 00000000..233481d9 --- /dev/null +++ b/next/tests/smoke_kernel_route.sh @@ -0,0 +1,121 @@ +#!/usr/bin/env bash +# next/tests/smoke_kernel_route.sh — m2 Blockers #4 unblock test. +# +# Proves a real HTTP listener over http:listen + http_server:start +# CAN now serve kernel-aware routes (the surface Blockers #4 made +# unreachable). Spins up a single sx_server instance, bootstraps an +# actor, starts http_server with {kernel, nx_kernel} in Cfg, and +# curls a route that fans through nx_kernel via gen_server:call. +# +# This is the kernel-route portion of Step 12's two-instance smoke +# test. The full two-instance flow (Follow + auto-accept + Note +# delivery) layers on top of this surface; this test is the +# load-bearing proof point that the underlying wiring works. + +set -uo pipefail +cd "$(git rev-parse --show-toplevel)" + +SX_SERVER="${SX_SERVER:-hosts/ocaml/_build/default/bin/sx_server.exe}" +if [ ! -x "$SX_SERVER" ]; then + SX_SERVER="/root/rose-ash/hosts/ocaml/_build/default/bin/sx_server.exe" +fi +if [ ! -x "$SX_SERVER" ]; then + echo "ERROR: sx_server.exe not found." >&2 + exit 1 +fi + +VERBOSE="${1:-}" +PASS=0; FAIL=0; ERRORS="" + +PORT=$(python3 -c 'import socket;s=socket.socket();s.bind(("127.0.0.1",0));print(s.getsockname()[1]);s.close()') +EF=$(mktemp); LOG=$(mktemp); FIFO=$(mktemp -u); mkfifo "$FIFO" +cleanup() { + for pid in ${SXP:-} ${HOLDP:-}; do + kill -KILL "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + done + rm -f "$EF" "$LOG" "$FIFO" +} +trap cleanup EXIT + +cat > "$EF" <>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(${PORT}, [{kernel, nx_kernel}])\")") +EPOCHS + +( cat "$EF"; sleep 900 ) > "$FIFO" & +HOLDP=$! +"$SX_SERVER" < "$FIFO" > "$LOG" 2>&1 & +SXP=$! +rm -f "$FIFO" + +START=$(date +%s) +BOUND= +while [ $(($(date +%s) - START)) -lt 300 ]; do + if (exec 3<>/dev/tcp/127.0.0.1/$PORT) 2>/dev/null; then + exec 3<&-; exec 3>&- + BOUND="yes after $(($(date +%s) - START))s" + break + fi + sleep 1 +done + +if [ -z "$BOUND" ]; then + echo "FAIL: listener never bound on port $PORT" + echo "--- log tail ---" + tail -20 "$LOG" + exit 1 +fi + +[ "$VERBOSE" = "-v" ] && echo " ok listener up ($BOUND)" + +check() { + local desc="$1" path="$2" needle="$3" + local resp + resp=$(curl -s --max-time 10 "http://127.0.0.1:$PORT$path" 2>/dev/null || echo "") + if echo "$resp" | grep -qF -- "$needle"; then + PASS=$((PASS+1)) + [ "$VERBOSE" = "-v" ] && echo " ok $desc" + else + FAIL=$((FAIL+1)) + ERRORS+=" FAIL [$desc] expected '$needle' in resp: $(echo "$resp" | head -c 100) +" + fi +} + +check "non-kernel welcome /" "/" "fed-sx kernel m1" +check "kernel-aware /actors/alice" "/actors/alice" "actor: alice" +check "kernel-aware /actors/alice/outbox" "/actors/alice/outbox" "outbox: alice" +check "kernel-aware /actors/alice/outbox tip" "/actors/alice/outbox" "tip: 0" +check "kernel-aware /actors/alice/inbox" "/actors/alice/inbox" "inbox: alice" +check "unknown actor /actors/zzz/outbox" "/actors/zzz/outbox" "outbox: zzz" + +TOTAL=$((PASS+FAIL)) +if [ $FAIL -eq 0 ]; then + echo "ok $PASS/$TOTAL next/tests/smoke_kernel_route.sh passed (port $PORT)" +else + echo "FAIL $PASS/$TOTAL passed, $FAIL failed:" + echo "$ERRORS" + if [ "$VERBOSE" = "-v" ]; then + echo "--- log tail ---"; tail -20 "$LOG" + fi +fi +[ $FAIL -eq 0 ] diff --git a/plans/fed-sx-milestone-2.md b/plans/fed-sx-milestone-2.md index 065f9f03..4de81f12 100644 --- a/plans/fed-sx-milestone-2.md +++ b/plans/fed-sx-milestone-2.md @@ -851,14 +851,23 @@ re-broadcast another actor's content to their own followers. ## Step 12 — Two-instance smoke test -**GATED on Blockers #4** (http-listen handler holds the SX runtime -mutex, deadlocking any `gen_server:call` from inside a route — see -Blockers section for verification + fix patterns). Without this, -the only request shapes that survive over real HTTP are the static / -capabilities / static-stub paths; every kernel-aware route hangs -indefinitely. The smoke test framework is sketched out (see the -withdrawn `smoke_federate.sh` in this loop's history at commit -`8d33d02f`'s tree state) but cannot exit 0 until Blockers #4 lifts. +**Blockers #4 RESOLVED 2026-06-07.** The substrate fix turned out +to be a two-line change in `lib/erlang/runtime.sx`: extend +`er-sched-step-alive!` to read `:pending-args` when present (was +hardcoded to `(list)`), and have `er-bif-http-listen`'s sx-handler +spawn the user handler as a real er-process with `:pending-args +(list req-pl)` instead of calling it inline. With this in place +any `receive` inside a kernel-aware route (e.g. `gen_server:call`) +suspends and resumes correctly inside the SX scheduler instead of +propagating out of the connection thread. + +Verified by `next/tests/smoke_kernel_route.sh` (6/6, single-instance): +welcome `/`, `/actors/alice`, `/actors/alice/outbox` (gen_server- +backed, with `tip:` from kernel state), `/actors/alice/inbox`, +unknown-actor outbox — all serve over real HTTP through +`http_server:start` with `Cfg = [{kernel, nx_kernel}]`. The +full two-instance Follow / Accept / Note flow can layer on top +of this surface. **The proof point.** `next/tests/smoke_federate.sh` spins up two kernel instances on distinct ports, walks them through the full federation @@ -1087,7 +1096,21 @@ proceed. 4. **`http-listen` handler holds the SX runtime mutex → `gen_server:call` from inside an HTTP route deadlocks.** — - discovered during Step 12 prep. The native `http-listen` + ~~discovered during Step 12 prep~~ **RESOLVED 2026-06-07** + by a two-line `lib/erlang/runtime.sx` change: extend + `er-sched-step-alive!` to read `:pending-args` when present + (was hardcoded to `(list)`), and rewrite + `er-bif-http-listen`'s sx-handler to spawn the user handler + as a real er-process with `:pending-args (list req-pl)` + instead of `er-apply-fun handler` inline. Any `receive` + inside a kernel-aware route now suspends + resumes inside + the SX scheduler. Verified via the new + `next/tests/smoke_kernel_route.sh` (6/6, single-instance + `http_server:start(P, [{kernel, nx_kernel}])` serves + welcome + `/actors/alice/outbox` with kernel-backed `tip:` + etc.). The full Pattern A vs Pattern B analysis below is + preserved for the audit trail. The original native + `http-listen` primitive in `bin/sx_server.ml:735+` serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx` so the SX runtime isn't re-entered concurrently. The wrapped Erlang handler