The serving-JIT perform-in-HO-callback miscompile (map/rest/drop wrong
CALL_PRIM args → blank pages, empty picker) is now fully fixed, so the host
runs 100% serving JIT with NO jit-exclude.
sx-vm-extensions 81177d0e resolves a suspended HO-callback's IO inline
(instead of unwinding the native map/filter loop and corrupting the stack),
but ONLY when a synchronous resolver is installed (!_cek_io_resolver = Some).
The host serves via the http-listen primitive, whose handler drove durable IO
through cek_run_with_io with the resolver = None — so it hit the unwinding
path the fix doesn't cover. (The vm-ext repro installed a resolver, so it
never exercised the host's real no-resolver path.)
Fix: extract cek_run_with_io's IO resolution into resolve_io_request, and have
http-listen install _cek_io_resolver := Some (fun req _ -> resolve_io_request
req) — byte-identical resolution, so the inline path resolves durable reads
exactly as the CEK loop would.
Verified: host conformance 271/271; ephemeral durable server at 100% JIT (no
exclude) zero fallbacks + real content + related shown + picker 12 candidates;
live blog.rose-ash.com home/post/tags 200 with related posts, zero error-log
lines; relate-picker Playwright 4/4 (infinite-scroll + filter + relate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.9 KiB
Hand-off: serving-mode JIT miscompiles host handlers (to sx-vm-extensions)
✅ RESOLVED 2026-06-28 — host now runs 100% serving JIT, no exclude.
Two composing pieces fixed it:
- sx-vm-extensions
81177d0e(sx_vm.mlcall_closure_reuse): when an HO-primitive callback (map/filter/reduce/…) suspends on aperformAND a synchronous resolver is installed, resolve its IO inline and run it to completion instead of unwinding the native loop (which dropped iteration state and misaligned the stack → the nextCALL_PRIMgot wrong args).- host side (
sx_server.ml): that fix only engages when!_cek_io_resolver = Some. The host serves via thehttp-listenprimitive, whose handler drove durable IO throughcek_run_with_iowith the resolver = None, so it hit the unwinding path the fix doesn't cover (the vm-extensions reprorepro_jit_resume.mlinstalled a resolver, so it never exercised the host's real path). Fix: extractedcek_run_with_io's IO resolution intoresolve_io_request, andhttp-listennow installs_cek_io_resolver := Some (fun req _ -> resolve_io_request req)— byte- identical resolution, so the inline-resolve path resolves durable reads exactly as the CEK loop would.Verified: host conformance 271/271; ephemeral durable server at 100% JIT (no exclude) — zero fallbacks, real content, related posts shown, picker lists 12 candidates; live blog.rose-ash.com home/post/tags 200 with related posts and zero error-log lines; relate-picker Playwright 4/4 (infinite-scroll + filter + relate, the
droppath).serve.shexclude dropped.Everything below is the original hand-off, kept for the record.
From the host-on-sx loop, 2026-06-28. We enabled
SX_SERVING_JIT=1on the live host (blog.rose-ash.com) — the Datalog/relations saturation JITs cleanly and is the real win (host conformance 271/271 under JIT, 5.4× faster; live/tags2.5s → 0.76s). BUT host app handlers MISCOMPILE in the serving path, so we had to(jit-exclude! "host/*" "dream-*" "dr/*")in serve.sh as a band-aid. Please fix the underlying bug so the exclude can be dropped.
Symptom
Under SX_SERVING_JIT=1, the FIRST request to most pages 500s, then self-heals
(retries 200). stderr shows, paired:
[jit] host/blog--edges-block first-call fallback to CEK: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
[http-listen] handler error: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
Also seen: Sx_types.Eval_error("rest: 1 list arg").
Two distinct bugs
(A) codegen / VM-state. A JIT'd function's bytecode runs CALL_PRIM "map"
(and rest) with args the primitive rejects (expected (fn list), 2 args
pushed but wrong). KEY CLUE: host conformance under SX_SERVING_JIT=1 is
271/271 — the SAME functions (host/blog--edges-block etc.) JIT fine when driven
via the epoch (eval ...) path. It ONLY miscompiles in the http-listen +
cek_run_with_io serving path. So it is not pure codegen — it's triggered by the
serving/IO context. Strong hypothesis: a perform/VmSuspended earlier in the
request (the handler does durable kv reads) resumes the VM with a misaligned
stack, so the NEXT CALL_PRIM (often a map) gets wrong args. The map/rest are
just the first prim call after a resume. Worth a vm-trace of a handler that
suspends then maps.
(B) fallback doesn't recover the failed call. register_jit_hook
(hosts/ocaml/bin/sx_server.ml ~L1607-1623): on first-call error it warns, sets
l.l_compiled <- jit_failed_sentinel, and returns None — intended to fall
through to CEK. But the error still escapes to the http-listen handler (→ 500)
instead of the call being re-run on CEK and returning a value. So even granting
(A), the request shouldn't 500: the fallback should recover THIS call, not just
mark the fn for next time. (Your own notes flagged this as the deferred
"propagate-don't-rerun" shared-CEK change — this is the same thing biting live.)
Fixing EITHER (A) or (B) unblocks the host: (A) removes the miscompile; (B) makes any miscompile self-heal on the first hit instead of 500ing.
Repro
- Build the merged binary (loops/host now carries sx-vm-extensions; the gate + render-page coexist in sx_server.ml's persistent serving branch).
SX_SERVING_JIT=1 bash lib/host/serve.shon a port (durable backend), but FIRST remove the(jit-exclude! "host/*" ...)line from serve.sh so host code JITs.curl http://127.0.0.1:PORT/welcome/→ first hit 500 (map: expected (fn list)), retry 200.curl /(home, uses map+rest) likewise.
Tooling: (vm-trace "<sx>"), (bytecode-inspect "host/blog--edges-block"),
(prim-check "host/blog--edges-block") (CLAUDE.md "VM/Bytecode Debugging").
Current mitigation (host side, to remove once fixed)
lib/host/serve.sh: when SX_SERVING_JIT=1, (jit-exclude! "host/*" "dream-*" "dr/*"). Host app + Dream framework run on CEK (they're IO-bound — no perf loss);
Datalog (dl-*/relations-*) keeps JITting (the win). Drop this once (A)/(B) land.
Refined data (100% JIT, no exclude, 2026-06-28)
Host now runs at 100% serving JIT (no jit-exclude). Out of 255 successful JIT
compiles, only ~3 functions miscompile, all on a multi-arg LIST PRIMITIVE with
wrong CALL_PRIM args, all in the durable-read request path, all failing on the
FIRST list-prim call after a perform (kv read):
host/blog--edges-block→map: expected (fn list) (CALL_PRIM "map" 2 args)- a fn using
rest→rest: 1 list arg host/blog-relate-options→drop: list and number (CALL_PRIM "drop" 2 args)
Conformance (epoch eval, no http-listen/perform) is 271/271 under JIT — so it's NOT the data-first swap alone; the serving/perform path is the trigger. Strongly supports the OP_PERFORM-resume stack-misalignment theory: the prim that fails is just the first CALL_PRIM after the resume. 252+ other fns JIT clean.