fed-sx-m2: narrow Blockers #4 root cause via connection-thread bisect
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 52s

Walked Pattern B's failure step-by-step from the connection thread
under a live http-listen instance, instrumenting each piece as its
own minimal sx-handler with a hardcoded reply dict:

  hardcoded {:status 200 :headers {} :body "..."}  -> HTTP 200 ✓
  read er-sched-process-count                      -> "procs=2" ✓
  er-pid-new!                                      -> 204 ✓
  er-proc-new! (er-env-new)                        -> 205 ✓
  er-spawn-fun (fn () 42)                          -> HTTP 000

The break is er-spawn-fun's (not (er-fun? fv)) gate raising
"Erlang: spawn/1: not a fun" because the raw SX lambda isn't an
Erlang-fun-shaped {:tag "fun"} dict. The `error` raise propagates
through Sx_runtime.sx_call and is swallowed by the native http-listen
(try ... with _ -> ()) at sx_server.ml:852; connection writes
nothing and closes -> curl reports HTTP 000.

This invalidates the previous "scheduler-re-entry race" hypothesis:
the global er-sched-* state IS shared with the connection thread
and reads correctly (process count of 2 = boot main + http:listen).
The breakage is the strict er-fun? shape check, not concurrency.

Path forward (still substrate scope, one helper):
  - Add an er-mk-host-fun helper in lib/erlang/runtime.sx (or a
    small AST-constructor in transpile.sx) that produces a real
    er-fun dict from a host SX closure.
  - sx-handler can then build a 0-arity wrapper-with-captured-req-pl
    and feed it to er-spawn-fun.
  - er-sched-run-all! drains, exit-result is read, response goes
    back to the wire.

Reverted runtime.sx to the Blockers #1 marshaller-bridge fix (the
in-flight Pattern B attempts are not committed). Blockers #4 entry
in plans/fed-sx-milestone-2.md updated with the verified diagnosis
and the one-helper path. Progress log entry added.

m2 stays at 11/12 steps; the substrate helper is loops/erlang scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 19:42:14 +00:00
parent 1d771aedea
commit 600d292ba2

View File

@@ -1112,28 +1112,54 @@ proceed.
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
and read the process's `:exit-result`. m2 tried this patch on
`lib/erlang/runtime.sx` and **it did not work**: the listener
binds, the connection thread enters `sx-handler`, but the
spawned process's response never reaches the wire — even the
non-kernel welcome route returns `HTTP 000` (empty reply).
binds, but every kernel-aware request returns HTTP 000.
Reproducer: spin up `http_server:start(P, [])` with the
Pattern B `sx-handler`; `curl http://127.0.0.1:P/` returns 000.
Why it fails (working hypothesis, m2 worktree): the
`http_server:start` spawn itself ran inside the outer
`erlang-eval-ast` scheduler pump and is **parked inside the
native `Unix.accept` loop on the boot thread**; the global
`er-sched-*` state still has that process in its queue. When
the connection thread calls `er-sched-run-all!` from inside
`sx-handler`, it re-enters the SAME global scheduler that
the boot thread is already pumping (the boot thread's
`er-sched-step!` of the http:listen process is blocked
forever inside the native primitive). The connection thread
spawns its handler process fine but `er-sched-run-all!`
either races against the boot thread's parked pump or
otherwise fails to drive the handler to completion before
the native handler returns. Reverted on m2 — `lib/erlang/
runtime.sx` stays at the Blockers #1 marshaller-bridge fix,
which is correct.
**Concrete reason (verified by isolated tests in the
connection thread, m2 worktree):** `er-spawn-fun` raises
`"Erlang: spawn/1: not a fun"` when called with the
raw SX lambda `(fn () (er-apply-fun handler (list req-pl)))`
because it gates on `(not (er-fun? fv))` and `er-fun?`
checks for the `{:tag "fun"}` Erlang-AST shape, not a host
Lambda. The user-supplied `handler` IS an `er-fun` (built
by the user's `fun (Req) -> route(Req, Cfg) end` form), but
we need a 0-arity wrapper to feed it `req-pl` — and
`er-sched-step-alive!` hardcodes `(er-apply-fun
(er-proc-field pid :initial-fun) (list))`, so the
wrapper must be 0-arity.
Verified piece-by-piece from the connection thread:
`er-pid-new!` → ok, `er-proc-new!` → ok, but
`er-spawn-fun (fn () 42)` → empty reply (the `error` raise
propagates through `Sx_runtime.sx_call` and gets caught by
the native http-listen `(try ... with _ -> ())` at
`sx_server.ml:852` so the connection writes nothing and
closes).
To make Pattern B actually work in pure SX you need a way
to construct an `er-fun` programmatically from a raw SX
closure (so the wrapper-with-captured-req-pl can be
spawned). The existing `er-mk-fun` takes Erlang AST
clauses, not host closures — building one inline either
needs an AST-constructor helper or a small parser call.
This is a one-helper substrate addition, not a redesign,
but it does need to live in `lib/erlang/transpile.sx` or
`runtime.sx` and probably wants an additive test.
Also: even with that helper, the original "race against
the parked boot-thread pump" concern is unverified.
Solo-piece tests inside the connection thread showed the
global `er-sched-*` state IS accessible there
(`er-sched-process-count` returned 2 — the boot main +
the spawned http:listen process). Once an `er-fun`
wrapper exists, the spawn + drain should at least
smoke-execute; what happens next under live load is the
next unknown.
Reverted on m2 — `lib/erlang/runtime.sx` stays at the
Blockers #1 marshaller-bridge fix, which is correct for
the non-kernel surface (welcome / capabilities / 404 /
401 over real HTTP).
The real fix likely needs ONE of:
- Native http-listen registers the listener and returns
@@ -1170,36 +1196,54 @@ proceed.
Newest first.
- **2026-06-07** — Re-investigated Pattern B with proper
instrumentation; **concrete failure root cause identified**.
Built each step of the spawn pipeline as its own minimal
`sx-handler` (hardcoded reply dict) and curled it:
hardcoded dict → 200 ✓, `er-sched-process-count` →
`procs=2` ✓ (boot main + http:listen process; global
scheduler IS accessible from the connection thread),
`er-pid-new!` → 204 ✓, `er-proc-new!` → 205 ✓ — all the
way up to `er-spawn-fun (fn () 42)` → HTTP 000. The break
is `er-spawn-fun`'s `(not (er-fun? fv))` gate raising
`"Erlang: spawn/1: not a fun"` because the raw SX lambda
isn't an Erlang-fun-shaped `{:tag "fun"}` dict. The
`error` raise propagates through `Sx_runtime.sx_call` and
is swallowed by the native http-listen
`(try ... with _ -> ())` at `sx_server.ml:852`; connection
writes nothing and closes.
Was previously waving at "race against parked boot-thread
pump" as the hypothesis — that part wasn't reproduced.
The global scheduler IS shared and the connection thread
reads it fine; the breakage is the strict `er-fun?` shape
check, not concurrency.
Path forward for Pattern B (still substrate scope): need a
way to construct an `er-fun` from a host SX closure so the
0-arity wrapper-with-captured-req-pl can be fed to
`er-spawn-fun`. Either a new `er-mk-host-fun` helper in
`lib/erlang/runtime.sx`, or a small AST-constructor in
`transpile.sx`. One-helper substrate addition, not a
redesign. Blockers #4 updated; once that helper lands the
spawn + drain should at least smoke-execute (whatever
concurrency issue surfaces next is the next unknown).
Reverted runtime.sx to the Blockers #1 marshaller-bridge
fix.
- **2026-06-07** — Tried `loops/fed-prims` `bf8d0bf2`'s Pattern B
patch sketch on `lib/erlang/runtime.sx`'s `er-bif-http-listen`:
wrap the handler call in `er-spawn-fun` + `er-sched-run-all!`
and read the spawned process's `:exit-result`. **It did not
work** — listener binds, but even the non-kernel welcome route
now returns HTTP 000 (the spawned handler's response never
reaches the wire). The simple `sx-handler` (direct
`er-apply-fun handler`) is preserved on m2 because it at least
serves welcome / capabilities / 404 / 401 correctly when no
kernel routes are touched. Reverted; runtime.sx stays at the
Blockers #1 marshaller-bridge fix.
Working hypothesis for why Pattern B fails on m2's
reproducer: the `http_server:start` spawn is itself parked
inside the native `Unix.accept` loop on the boot thread; the
global `er-sched-*` state still has that process in its
queue. When the connection thread (under the per-instance
native mutex) calls `er-sched-run-all!`, it re-enters the
SAME global scheduler — the boot thread's `er-sched-step!`
of the http:listen process is blocked forever inside the
native primitive, so the connection-thread pump either
races against that parked frame or otherwise fails to drive
the new handler process to completion before the connection
thread returns from `sx-handler`. The fed-prims diagnosis
was correct that the bug is Erlang-substrate scope and that
Pattern A (the mutex) doesn't apply, but the Pattern B
sketch assumed a fresh / private scheduler context that
doesn't exist in the current substrate. Blockers #4
updated to capture this + sketch the three substrate fixes
that would actually work; loop pacing dialled back down.
reaches the wire). Reverted; runtime.sx stays at the
Blockers #1 marshaller-bridge fix. Initially hypothesised the
failure was a scheduler-re-entry race (parked Unix.accept
pump on the boot thread vs. connection-thread pump); the
follow-up tick above narrowed the root cause to the
`er-fun?` shape gate — see that entry for the verified
diagnosis.
- **2026-06-07** — Step 12 prep discovered Blockers #4
(http-listen handler holds the SX runtime mutex; any