Files
rose-ash/plans/HANDOFF-jit-miscompile.md
2026-06-28 19:09:26 +00:00

65 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hand-off: serving-mode JIT miscompiles host handlers (to sx-vm-extensions)
> From the **host-on-sx** loop, 2026-06-28. We enabled `SX_SERVING_JIT=1` on the
> live host (blog.rose-ash.com) — the Datalog/relations saturation JITs cleanly
> and is the real win (host conformance 271/271 under JIT, 5.4× faster; live
> `/tags` 2.5s → 0.76s). BUT host app handlers MISCOMPILE in the serving path, so
> we had to `(jit-exclude! "host/*" "dream-*" "dr/*")` in serve.sh as a band-aid.
> Please fix the underlying bug so the exclude can be dropped.
## Symptom
Under `SX_SERVING_JIT=1`, the FIRST request to most pages 500s, then self-heals
(retries 200). stderr shows, paired:
```
[jit] host/blog--edges-block first-call fallback to CEK: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
[http-listen] handler error: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
```
Also seen: `Sx_types.Eval_error("rest: 1 list arg")`.
## Two distinct bugs
**(A) codegen / VM-state.** A JIT'd function's bytecode runs `CALL_PRIM "map"`
(and `rest`) with args the primitive rejects (`expected (fn list)`, 2 args
pushed but wrong). KEY CLUE: **host conformance under `SX_SERVING_JIT=1` is
271/271** — the SAME functions (host/blog--edges-block etc.) JIT fine when driven
via the epoch `(eval ...)` path. It ONLY miscompiles in the **http-listen +
cek_run_with_io** serving path. So it is not pure codegen — it's triggered by the
serving/IO context. Strong hypothesis: a `perform`/`VmSuspended` earlier in the
request (the handler does durable kv reads) resumes the VM with a misaligned
stack, so the NEXT `CALL_PRIM` (often a `map`) gets wrong args. The map/rest are
just the first prim call after a resume. Worth a `vm-trace` of a handler that
suspends then maps.
**(B) fallback doesn't recover the failed call.** `register_jit_hook`
(`hosts/ocaml/bin/sx_server.ml` ~L1607-1623): on first-call error it warns, sets
`l.l_compiled <- jit_failed_sentinel`, and returns `None` — intended to fall
through to CEK. But the error still escapes to the http-listen handler (→ 500)
instead of the call being re-run on CEK and returning a value. So even granting
(A), the request shouldn't 500: the fallback should recover THIS call, not just
mark the fn for next time. (Your own notes flagged this as the deferred
"propagate-don't-rerun" shared-CEK change — this is the same thing biting live.)
Fixing EITHER (A) or (B) unblocks the host: (A) removes the miscompile; (B) makes
any miscompile self-heal on the first hit instead of 500ing.
## Repro
1. Build the merged binary (loops/host now carries sx-vm-extensions; the gate +
render-page coexist in sx_server.ml's persistent serving branch).
2. `SX_SERVING_JIT=1 bash lib/host/serve.sh` on a port (durable backend), but
FIRST remove the `(jit-exclude! "host/*" ...)` line from serve.sh so host code
JITs.
3. `curl http://127.0.0.1:PORT/welcome/` → first hit 500 (`map: expected (fn list)`),
retry 200. `curl /` (home, uses map+rest) likewise.
Tooling: `(vm-trace "<sx>")`, `(bytecode-inspect "host/blog--edges-block")`,
`(prim-check "host/blog--edges-block")` (CLAUDE.md "VM/Bytecode Debugging").
## Current mitigation (host side, to remove once fixed)
`lib/host/serve.sh`: when `SX_SERVING_JIT=1`, `(jit-exclude! "host/*" "dream-*"
"dr/*")`. Host app + Dream framework run on CEK (they're IO-bound — no perf loss);
Datalog (`dl-*`/`relations-*`) keeps JITting (the win). Drop this once (A)/(B) land.