From 600d292ba2e3db7bb38aec7d4a37ea5b0626d498 Mon Sep 17 00:00:00 2001 From: giles Date: Sun, 7 Jun 2026 19:42:14 +0000 Subject: [PATCH] fed-sx-m2: narrow Blockers #4 root cause via connection-thread bisect MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Walked Pattern B's failure step-by-step from the connection thread under a live http-listen instance, instrumenting each piece as its own minimal sx-handler with a hardcoded reply dict: hardcoded {:status 200 :headers {} :body "..."} -> HTTP 200 ✓ read er-sched-process-count -> "procs=2" ✓ er-pid-new! -> 204 ✓ er-proc-new! (er-env-new) -> 205 ✓ er-spawn-fun (fn () 42) -> HTTP 000 The break is er-spawn-fun's (not (er-fun? fv)) gate raising "Erlang: spawn/1: not a fun" because the raw SX lambda isn't an Erlang-fun-shaped {:tag "fun"} dict. The `error` raise propagates through Sx_runtime.sx_call and is swallowed by the native http-listen (try ... with _ -> ()) at sx_server.ml:852; connection writes nothing and closes -> curl reports HTTP 000. This invalidates the previous "scheduler-re-entry race" hypothesis: the global er-sched-* state IS shared with the connection thread and reads correctly (process count of 2 = boot main + http:listen). The breakage is the strict er-fun? shape check, not concurrency. Path forward (still substrate scope, one helper): - Add an er-mk-host-fun helper in lib/erlang/runtime.sx (or a small AST-constructor in transpile.sx) that produces a real er-fun dict from a host SX closure. - sx-handler can then build a 0-arity wrapper-with-captured-req-pl and feed it to er-spawn-fun. - er-sched-run-all! drains, exit-result is read, response goes back to the wire. Reverted runtime.sx to the Blockers #1 marshaller-bridge fix (the in-flight Pattern B attempts are not committed). Blockers #4 entry in plans/fed-sx-milestone-2.md updated with the verified diagnosis and the one-helper path. Progress log entry added. m2 stays at 11/12 steps; the substrate helper is loops/erlang scope. Co-Authored-By: Claude Opus 4.7 (1M context) --- plans/fed-sx-milestone-2.md | 130 ++++++++++++++++++++++++------------ 1 file changed, 87 insertions(+), 43 deletions(-) diff --git a/plans/fed-sx-milestone-2.md b/plans/fed-sx-milestone-2.md index c8f79d6e..065f9f03 100644 --- a/plans/fed-sx-milestone-2.md +++ b/plans/fed-sx-milestone-2.md @@ -1112,28 +1112,54 @@ proceed. wrap the handler call in `er-spawn-fun` + `er-sched-run-all!` and read the process's `:exit-result`. m2 tried this patch on `lib/erlang/runtime.sx` and **it did not work**: the listener - binds, the connection thread enters `sx-handler`, but the - spawned process's response never reaches the wire — even the - non-kernel welcome route returns `HTTP 000` (empty reply). + binds, but every kernel-aware request returns HTTP 000. Reproducer: spin up `http_server:start(P, [])` with the Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000. - Why it fails (working hypothesis, m2 worktree): the - `http_server:start` spawn itself ran inside the outer - `erlang-eval-ast` scheduler pump and is **parked inside the - native `Unix.accept` loop on the boot thread**; the global - `er-sched-*` state still has that process in its queue. When - the connection thread calls `er-sched-run-all!` from inside - `sx-handler`, it re-enters the SAME global scheduler that - the boot thread is already pumping (the boot thread's - `er-sched-step!` of the http:listen process is blocked - forever inside the native primitive). The connection thread - spawns its handler process fine but `er-sched-run-all!` - either races against the boot thread's parked pump or - otherwise fails to drive the handler to completion before - the native handler returns. Reverted on m2 — `lib/erlang/ - runtime.sx` stays at the Blockers #1 marshaller-bridge fix, - which is correct. + **Concrete reason (verified by isolated tests in the + connection thread, m2 worktree):** `er-spawn-fun` raises + `"Erlang: spawn/1: not a fun"` when called with the + raw SX lambda `(fn () (er-apply-fun handler (list req-pl)))` + because it gates on `(not (er-fun? fv))` and `er-fun?` + checks for the `{:tag "fun"}` Erlang-AST shape, not a host + Lambda. The user-supplied `handler` IS an `er-fun` (built + by the user's `fun (Req) -> route(Req, Cfg) end` form), but + we need a 0-arity wrapper to feed it `req-pl` — and + `er-sched-step-alive!` hardcodes `(er-apply-fun + (er-proc-field pid :initial-fun) (list))`, so the + wrapper must be 0-arity. + Verified piece-by-piece from the connection thread: + `er-pid-new!` → ok, `er-proc-new!` → ok, but + `er-spawn-fun (fn () 42)` → empty reply (the `error` raise + propagates through `Sx_runtime.sx_call` and gets caught by + the native http-listen `(try ... with _ -> ())` at + `sx_server.ml:852` so the connection writes nothing and + closes). + + To make Pattern B actually work in pure SX you need a way + to construct an `er-fun` programmatically from a raw SX + closure (so the wrapper-with-captured-req-pl can be + spawned). The existing `er-mk-fun` takes Erlang AST + clauses, not host closures — building one inline either + needs an AST-constructor helper or a small parser call. + This is a one-helper substrate addition, not a redesign, + but it does need to live in `lib/erlang/transpile.sx` or + `runtime.sx` and probably wants an additive test. + + Also: even with that helper, the original "race against + the parked boot-thread pump" concern is unverified. + Solo-piece tests inside the connection thread showed the + global `er-sched-*` state IS accessible there + (`er-sched-process-count` returned 2 — the boot main + + the spawned http:listen process). Once an `er-fun` + wrapper exists, the spawn + drain should at least + smoke-execute; what happens next under live load is the + next unknown. + + Reverted on m2 — `lib/erlang/runtime.sx` stays at the + Blockers #1 marshaller-bridge fix, which is correct for + the non-kernel surface (welcome / capabilities / 404 / + 401 over real HTTP). The real fix likely needs ONE of: - Native http-listen registers the listener and returns @@ -1170,36 +1196,54 @@ proceed. Newest first. +- **2026-06-07** — Re-investigated Pattern B with proper + instrumentation; **concrete failure root cause identified**. + Built each step of the spawn pipeline as its own minimal + `sx-handler` (hardcoded reply dict) and curled it: + hardcoded dict → 200 ✓, `er-sched-process-count` → + `procs=2` ✓ (boot main + http:listen process; global + scheduler IS accessible from the connection thread), + `er-pid-new!` → 204 ✓, `er-proc-new!` → 205 ✓ — all the + way up to `er-spawn-fun (fn () 42)` → HTTP 000. The break + is `er-spawn-fun`'s `(not (er-fun? fv))` gate raising + `"Erlang: spawn/1: not a fun"` because the raw SX lambda + isn't an Erlang-fun-shaped `{:tag "fun"}` dict. The + `error` raise propagates through `Sx_runtime.sx_call` and + is swallowed by the native http-listen + `(try ... with _ -> ())` at `sx_server.ml:852`; connection + writes nothing and closes. + + Was previously waving at "race against parked boot-thread + pump" as the hypothesis — that part wasn't reproduced. + The global scheduler IS shared and the connection thread + reads it fine; the breakage is the strict `er-fun?` shape + check, not concurrency. + + Path forward for Pattern B (still substrate scope): need a + way to construct an `er-fun` from a host SX closure so the + 0-arity wrapper-with-captured-req-pl can be fed to + `er-spawn-fun`. Either a new `er-mk-host-fun` helper in + `lib/erlang/runtime.sx`, or a small AST-constructor in + `transpile.sx`. One-helper substrate addition, not a + redesign. Blockers #4 updated; once that helper lands the + spawn + drain should at least smoke-execute (whatever + concurrency issue surfaces next is the next unknown). + Reverted runtime.sx to the Blockers #1 marshaller-bridge + fix. + - **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`: wrap the handler call in `er-spawn-fun` + `er-sched-run-all!` and read the spawned process's `:exit-result`. **It did not work** — listener binds, but even the non-kernel welcome route now returns HTTP 000 (the spawned handler's response never - reaches the wire). The simple `sx-handler` (direct - `er-apply-fun handler`) is preserved on m2 because it at least - serves welcome / capabilities / 404 / 401 correctly when no - kernel routes are touched. Reverted; runtime.sx stays at the - Blockers #1 marshaller-bridge fix. - - Working hypothesis for why Pattern B fails on m2's - reproducer: the `http_server:start` spawn is itself parked - inside the native `Unix.accept` loop on the boot thread; the - global `er-sched-*` state still has that process in its - queue. When the connection thread (under the per-instance - native mutex) calls `er-sched-run-all!`, it re-enters the - SAME global scheduler — the boot thread's `er-sched-step!` - of the http:listen process is blocked forever inside the - native primitive, so the connection-thread pump either - races against that parked frame or otherwise fails to drive - the new handler process to completion before the connection - thread returns from `sx-handler`. The fed-prims diagnosis - was correct that the bug is Erlang-substrate scope and that - Pattern A (the mutex) doesn't apply, but the Pattern B - sketch assumed a fresh / private scheduler context that - doesn't exist in the current substrate. Blockers #4 - updated to capture this + sketch the three substrate fixes - that would actually work; loop pacing dialled back down. + reaches the wire). Reverted; runtime.sx stays at the + Blockers #1 marshaller-bridge fix. Initially hypothesised the + failure was a scheduler-re-entry race (parked Unix.accept + pump on the boot thread vs. connection-thread pump); the + follow-up tick above narrowed the root cause to the + `er-fun?` shape gate — see that entry for the verified + diagnosis. - **2026-06-07** — Step 12 prep discovered Blockers #4 (http-listen handler holds the SX runtime mutex; any