Files
rose-ash/plans/HANDOFF-jit-miscompile.md
giles a697904c7c
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 26s
docs: refined serving-JIT miscompile data (3 fns, list-prim-after-perform)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 19:16:32 +00:00

4.3 KiB
Raw Blame History

Hand-off: serving-mode JIT miscompiles host handlers (to sx-vm-extensions)

From the host-on-sx loop, 2026-06-28. We enabled SX_SERVING_JIT=1 on the live host (blog.rose-ash.com) — the Datalog/relations saturation JITs cleanly and is the real win (host conformance 271/271 under JIT, 5.4× faster; live /tags 2.5s → 0.76s). BUT host app handlers MISCOMPILE in the serving path, so we had to (jit-exclude! "host/*" "dream-*" "dr/*") in serve.sh as a band-aid. Please fix the underlying bug so the exclude can be dropped.

Symptom

Under SX_SERVING_JIT=1, the FIRST request to most pages 500s, then self-heals (retries 200). stderr shows, paired:

[jit] host/blog--edges-block first-call fallback to CEK: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
[http-listen] handler error: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")

Also seen: Sx_types.Eval_error("rest: 1 list arg").

Two distinct bugs

(A) codegen / VM-state. A JIT'd function's bytecode runs CALL_PRIM "map" (and rest) with args the primitive rejects (expected (fn list), 2 args pushed but wrong). KEY CLUE: host conformance under SX_SERVING_JIT=1 is 271/271 — the SAME functions (host/blog--edges-block etc.) JIT fine when driven via the epoch (eval ...) path. It ONLY miscompiles in the http-listen + cek_run_with_io serving path. So it is not pure codegen — it's triggered by the serving/IO context. Strong hypothesis: a perform/VmSuspended earlier in the request (the handler does durable kv reads) resumes the VM with a misaligned stack, so the NEXT CALL_PRIM (often a map) gets wrong args. The map/rest are just the first prim call after a resume. Worth a vm-trace of a handler that suspends then maps.

(B) fallback doesn't recover the failed call. register_jit_hook (hosts/ocaml/bin/sx_server.ml ~L1607-1623): on first-call error it warns, sets l.l_compiled <- jit_failed_sentinel, and returns None — intended to fall through to CEK. But the error still escapes to the http-listen handler (→ 500) instead of the call being re-run on CEK and returning a value. So even granting (A), the request shouldn't 500: the fallback should recover THIS call, not just mark the fn for next time. (Your own notes flagged this as the deferred "propagate-don't-rerun" shared-CEK change — this is the same thing biting live.)

Fixing EITHER (A) or (B) unblocks the host: (A) removes the miscompile; (B) makes any miscompile self-heal on the first hit instead of 500ing.

Repro

  1. Build the merged binary (loops/host now carries sx-vm-extensions; the gate + render-page coexist in sx_server.ml's persistent serving branch).
  2. SX_SERVING_JIT=1 bash lib/host/serve.sh on a port (durable backend), but FIRST remove the (jit-exclude! "host/*" ...) line from serve.sh so host code JITs.
  3. curl http://127.0.0.1:PORT/welcome/ → first hit 500 (map: expected (fn list)), retry 200. curl / (home, uses map+rest) likewise.

Tooling: (vm-trace "<sx>"), (bytecode-inspect "host/blog--edges-block"), (prim-check "host/blog--edges-block") (CLAUDE.md "VM/Bytecode Debugging").

Current mitigation (host side, to remove once fixed)

lib/host/serve.sh: when SX_SERVING_JIT=1, (jit-exclude! "host/*" "dream-*" "dr/*"). Host app + Dream framework run on CEK (they're IO-bound — no perf loss); Datalog (dl-*/relations-*) keeps JITting (the win). Drop this once (A)/(B) land.

Refined data (100% JIT, no exclude, 2026-06-28)

Host now runs at 100% serving JIT (no jit-exclude). Out of 255 successful JIT compiles, only ~3 functions miscompile, all on a multi-arg LIST PRIMITIVE with wrong CALL_PRIM args, all in the durable-read request path, all failing on the FIRST list-prim call after a perform (kv read):

  • host/blog--edges-blockmap: expected (fn list) (CALL_PRIM "map" 2 args)
  • a fn using restrest: 1 list arg
  • host/blog-relate-optionsdrop: list and number (CALL_PRIM "drop" 2 args)

Conformance (epoch eval, no http-listen/perform) is 271/271 under JIT — so it's NOT the data-first swap alone; the serving/perform path is the trigger. Strongly supports the OP_PERFORM-resume stack-misalignment theory: the prim that fails is just the first CALL_PRIM after the resume. 252+ other fns JIT clean.