fed-sx-m2: narrow Blockers #4 root cause via connection-thread bisect
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 52s
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 52s
Walked Pattern B's failure step-by-step from the connection thread
under a live http-listen instance, instrumenting each piece as its
own minimal sx-handler with a hardcoded reply dict:
hardcoded {:status 200 :headers {} :body "..."} -> HTTP 200 ✓
read er-sched-process-count -> "procs=2" ✓
er-pid-new! -> 204 ✓
er-proc-new! (er-env-new) -> 205 ✓
er-spawn-fun (fn () 42) -> HTTP 000
The break is er-spawn-fun's (not (er-fun? fv)) gate raising
"Erlang: spawn/1: not a fun" because the raw SX lambda isn't an
Erlang-fun-shaped {:tag "fun"} dict. The `error` raise propagates
through Sx_runtime.sx_call and is swallowed by the native http-listen
(try ... with _ -> ()) at sx_server.ml:852; connection writes
nothing and closes -> curl reports HTTP 000.
This invalidates the previous "scheduler-re-entry race" hypothesis:
the global er-sched-* state IS shared with the connection thread
and reads correctly (process count of 2 = boot main + http:listen).
The breakage is the strict er-fun? shape check, not concurrency.
Path forward (still substrate scope, one helper):
- Add an er-mk-host-fun helper in lib/erlang/runtime.sx (or a
small AST-constructor in transpile.sx) that produces a real
er-fun dict from a host SX closure.
- sx-handler can then build a 0-arity wrapper-with-captured-req-pl
and feed it to er-spawn-fun.
- er-sched-run-all! drains, exit-result is read, response goes
back to the wire.
Reverted runtime.sx to the Blockers #1 marshaller-bridge fix (the
in-flight Pattern B attempts are not committed). Blockers #4 entry
in plans/fed-sx-milestone-2.md updated with the verified diagnosis
and the one-helper path. Progress log entry added.
m2 stays at 11/12 steps; the substrate helper is loops/erlang scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1112,28 +1112,54 @@ proceed.
|
||||
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
|
||||
and read the process's `:exit-result`. m2 tried this patch on
|
||||
`lib/erlang/runtime.sx` and **it did not work**: the listener
|
||||
binds, the connection thread enters `sx-handler`, but the
|
||||
spawned process's response never reaches the wire — even the
|
||||
non-kernel welcome route returns `HTTP 000` (empty reply).
|
||||
binds, but every kernel-aware request returns HTTP 000.
|
||||
Reproducer: spin up `http_server:start(P, [])` with the
|
||||
Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000.
|
||||
|
||||
Why it fails (working hypothesis, m2 worktree): the
|
||||
`http_server:start` spawn itself ran inside the outer
|
||||
`erlang-eval-ast` scheduler pump and is **parked inside the
|
||||
native `Unix.accept` loop on the boot thread**; the global
|
||||
`er-sched-*` state still has that process in its queue. When
|
||||
the connection thread calls `er-sched-run-all!` from inside
|
||||
`sx-handler`, it re-enters the SAME global scheduler that
|
||||
the boot thread is already pumping (the boot thread's
|
||||
`er-sched-step!` of the http:listen process is blocked
|
||||
forever inside the native primitive). The connection thread
|
||||
spawns its handler process fine but `er-sched-run-all!`
|
||||
either races against the boot thread's parked pump or
|
||||
otherwise fails to drive the handler to completion before
|
||||
the native handler returns. Reverted on m2 — `lib/erlang/
|
||||
runtime.sx` stays at the Blockers #1 marshaller-bridge fix,
|
||||
which is correct.
|
||||
**Concrete reason (verified by isolated tests in the
|
||||
connection thread, m2 worktree):** `er-spawn-fun` raises
|
||||
`"Erlang: spawn/1: not a fun"` when called with the
|
||||
raw SX lambda `(fn () (er-apply-fun handler (list req-pl)))`
|
||||
because it gates on `(not (er-fun? fv))` and `er-fun?`
|
||||
checks for the `{:tag "fun"}` Erlang-AST shape, not a host
|
||||
Lambda. The user-supplied `handler` IS an `er-fun` (built
|
||||
by the user's `fun (Req) -> route(Req, Cfg) end` form), but
|
||||
we need a 0-arity wrapper to feed it `req-pl` — and
|
||||
`er-sched-step-alive!` hardcodes `(er-apply-fun
|
||||
(er-proc-field pid :initial-fun) (list))`, so the
|
||||
wrapper must be 0-arity.
|
||||
Verified piece-by-piece from the connection thread:
|
||||
`er-pid-new!` → ok, `er-proc-new!` → ok, but
|
||||
`er-spawn-fun (fn () 42)` → empty reply (the `error` raise
|
||||
propagates through `Sx_runtime.sx_call` and gets caught by
|
||||
the native http-listen `(try ... with _ -> ())` at
|
||||
`sx_server.ml:852` so the connection writes nothing and
|
||||
closes).
|
||||
|
||||
To make Pattern B actually work in pure SX you need a way
|
||||
to construct an `er-fun` programmatically from a raw SX
|
||||
closure (so the wrapper-with-captured-req-pl can be
|
||||
spawned). The existing `er-mk-fun` takes Erlang AST
|
||||
clauses, not host closures — building one inline either
|
||||
needs an AST-constructor helper or a small parser call.
|
||||
This is a one-helper substrate addition, not a redesign,
|
||||
but it does need to live in `lib/erlang/transpile.sx` or
|
||||
`runtime.sx` and probably wants an additive test.
|
||||
|
||||
Also: even with that helper, the original "race against
|
||||
the parked boot-thread pump" concern is unverified.
|
||||
Solo-piece tests inside the connection thread showed the
|
||||
global `er-sched-*` state IS accessible there
|
||||
(`er-sched-process-count` returned 2 — the boot main +
|
||||
the spawned http:listen process). Once an `er-fun`
|
||||
wrapper exists, the spawn + drain should at least
|
||||
smoke-execute; what happens next under live load is the
|
||||
next unknown.
|
||||
|
||||
Reverted on m2 — `lib/erlang/runtime.sx` stays at the
|
||||
Blockers #1 marshaller-bridge fix, which is correct for
|
||||
the non-kernel surface (welcome / capabilities / 404 /
|
||||
401 over real HTTP).
|
||||
|
||||
The real fix likely needs ONE of:
|
||||
- Native http-listen registers the listener and returns
|
||||
@@ -1170,36 +1196,54 @@ proceed.
|
||||
|
||||
Newest first.
|
||||
|
||||
- **2026-06-07** — Re-investigated Pattern B with proper
|
||||
instrumentation; **concrete failure root cause identified**.
|
||||
Built each step of the spawn pipeline as its own minimal
|
||||
`sx-handler` (hardcoded reply dict) and curled it:
|
||||
hardcoded dict → 200 ✓, `er-sched-process-count` →
|
||||
`procs=2` ✓ (boot main + http:listen process; global
|
||||
scheduler IS accessible from the connection thread),
|
||||
`er-pid-new!` → 204 ✓, `er-proc-new!` → 205 ✓ — all the
|
||||
way up to `er-spawn-fun (fn () 42)` → HTTP 000. The break
|
||||
is `er-spawn-fun`'s `(not (er-fun? fv))` gate raising
|
||||
`"Erlang: spawn/1: not a fun"` because the raw SX lambda
|
||||
isn't an Erlang-fun-shaped `{:tag "fun"}` dict. The
|
||||
`error` raise propagates through `Sx_runtime.sx_call` and
|
||||
is swallowed by the native http-listen
|
||||
`(try ... with _ -> ())` at `sx_server.ml:852`; connection
|
||||
writes nothing and closes.
|
||||
|
||||
Was previously waving at "race against parked boot-thread
|
||||
pump" as the hypothesis — that part wasn't reproduced.
|
||||
The global scheduler IS shared and the connection thread
|
||||
reads it fine; the breakage is the strict `er-fun?` shape
|
||||
check, not concurrency.
|
||||
|
||||
Path forward for Pattern B (still substrate scope): need a
|
||||
way to construct an `er-fun` from a host SX closure so the
|
||||
0-arity wrapper-with-captured-req-pl can be fed to
|
||||
`er-spawn-fun`. Either a new `er-mk-host-fun` helper in
|
||||
`lib/erlang/runtime.sx`, or a small AST-constructor in
|
||||
`transpile.sx`. One-helper substrate addition, not a
|
||||
redesign. Blockers #4 updated; once that helper lands the
|
||||
spawn + drain should at least smoke-execute (whatever
|
||||
concurrency issue surfaces next is the next unknown).
|
||||
Reverted runtime.sx to the Blockers #1 marshaller-bridge
|
||||
fix.
|
||||
|
||||
- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B
|
||||
patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`:
|
||||
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
|
||||
and read the spawned process's `:exit-result`. **It did not
|
||||
work** — listener binds, but even the non-kernel welcome route
|
||||
now returns HTTP 000 (the spawned handler's response never
|
||||
reaches the wire). The simple `sx-handler` (direct
|
||||
`er-apply-fun handler`) is preserved on m2 because it at least
|
||||
serves welcome / capabilities / 404 / 401 correctly when no
|
||||
kernel routes are touched. Reverted; runtime.sx stays at the
|
||||
Blockers #1 marshaller-bridge fix.
|
||||
|
||||
Working hypothesis for why Pattern B fails on m2's
|
||||
reproducer: the `http_server:start` spawn is itself parked
|
||||
inside the native `Unix.accept` loop on the boot thread; the
|
||||
global `er-sched-*` state still has that process in its
|
||||
queue. When the connection thread (under the per-instance
|
||||
native mutex) calls `er-sched-run-all!`, it re-enters the
|
||||
SAME global scheduler — the boot thread's `er-sched-step!`
|
||||
of the http:listen process is blocked forever inside the
|
||||
native primitive, so the connection-thread pump either
|
||||
races against that parked frame or otherwise fails to drive
|
||||
the new handler process to completion before the connection
|
||||
thread returns from `sx-handler`. The fed-prims diagnosis
|
||||
was correct that the bug is Erlang-substrate scope and that
|
||||
Pattern A (the mutex) doesn't apply, but the Pattern B
|
||||
sketch assumed a fresh / private scheduler context that
|
||||
doesn't exist in the current substrate. Blockers #4
|
||||
updated to capture this + sketch the three substrate fixes
|
||||
that would actually work; loop pacing dialled back down.
|
||||
reaches the wire). Reverted; runtime.sx stays at the
|
||||
Blockers #1 marshaller-bridge fix. Initially hypothesised the
|
||||
failure was a scheduler-re-entry race (parked Unix.accept
|
||||
pump on the boot thread vs. connection-thread pump); the
|
||||
follow-up tick above narrowed the root cause to the
|
||||
`er-fun?` shape gate — see that entry for the verified
|
||||
diagnosis.
|
||||
|
||||
- **2026-06-07** — Step 12 prep discovered Blockers #4
|
||||
(http-listen handler holds the SX runtime mutex; any
|
||||
|
||||
Reference in New Issue
Block a user