Files
rose-ash/plans/HANDOFF-jit-miscompile.md
giles d8d7663565
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 33s
host: fix serving-JIT host miscompile — install IO resolver for http-listen
The serving-JIT perform-in-HO-callback miscompile (map/rest/drop wrong
CALL_PRIM args → blank pages, empty picker) is now fully fixed, so the host
runs 100% serving JIT with NO jit-exclude.

sx-vm-extensions 81177d0e resolves a suspended HO-callback's IO inline
(instead of unwinding the native map/filter loop and corrupting the stack),
but ONLY when a synchronous resolver is installed (!_cek_io_resolver = Some).
The host serves via the http-listen primitive, whose handler drove durable IO
through cek_run_with_io with the resolver = None — so it hit the unwinding
path the fix doesn't cover. (The vm-ext repro installed a resolver, so it
never exercised the host's real no-resolver path.)

Fix: extract cek_run_with_io's IO resolution into resolve_io_request, and have
http-listen install _cek_io_resolver := Some (fun req _ -> resolve_io_request
req) — byte-identical resolution, so the inline path resolves durable reads
exactly as the CEK loop would.

Verified: host conformance 271/271; ephemeral durable server at 100% JIT (no
exclude) zero fallbacks + real content + related shown + picker 12 candidates;
live blog.rose-ash.com home/post/tags 200 with related posts, zero error-log
lines; relate-picker Playwright 4/4 (infinite-scroll + filter + relate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 20:13:24 +00:00

109 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hand-off: serving-mode JIT miscompiles host handlers (to sx-vm-extensions)
> ## ✅ RESOLVED 2026-06-28 — host now runs 100% serving JIT, no exclude.
>
> Two composing pieces fixed it:
> 1. **sx-vm-extensions `81177d0e`** (`sx_vm.ml` `call_closure_reuse`): when an
> HO-primitive callback (map/filter/reduce/…) suspends on a `perform` AND a
> synchronous resolver is installed, resolve its IO inline and run it to
> completion instead of unwinding the native loop (which dropped iteration
> state and misaligned the stack → the next `CALL_PRIM` got wrong args).
> 2. **host side (`sx_server.ml`)**: that fix only engages when
> `!_cek_io_resolver = Some`. The host serves via the `http-listen` primitive,
> whose handler drove durable IO through `cek_run_with_io` with the resolver
> **= None**, so it hit the unwinding path the fix doesn't cover (the
> vm-extensions repro `repro_jit_resume.ml` *installed* a resolver, so it never
> exercised the host's real path). Fix: extracted `cek_run_with_io`'s IO
> resolution into `resolve_io_request`, and `http-listen` now installs
> `_cek_io_resolver := Some (fun req _ -> resolve_io_request req)` — byte-
> identical resolution, so the inline-resolve path resolves durable reads
> exactly as the CEK loop would.
>
> Verified: host conformance **271/271**; ephemeral durable server at 100% JIT
> (no exclude) — zero fallbacks, real content, related posts shown, picker lists
> 12 candidates; live blog.rose-ash.com home/post/tags 200 with related posts and
> zero error-log lines; relate-picker Playwright **4/4** (infinite-scroll +
> filter + relate, the `drop` path). `serve.sh` exclude dropped.
>
> Everything below is the original hand-off, kept for the record.
---
> From the **host-on-sx** loop, 2026-06-28. We enabled `SX_SERVING_JIT=1` on the
> live host (blog.rose-ash.com) — the Datalog/relations saturation JITs cleanly
> and is the real win (host conformance 271/271 under JIT, 5.4× faster; live
> `/tags` 2.5s → 0.76s). BUT host app handlers MISCOMPILE in the serving path, so
> we had to `(jit-exclude! "host/*" "dream-*" "dr/*")` in serve.sh as a band-aid.
> Please fix the underlying bug so the exclude can be dropped.
## Symptom
Under `SX_SERVING_JIT=1`, the FIRST request to most pages 500s, then self-heals
(retries 200). stderr shows, paired:
```
[jit] host/blog--edges-block first-call fallback to CEK: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
[http-listen] handler error: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
```
Also seen: `Sx_types.Eval_error("rest: 1 list arg")`.
## Two distinct bugs
**(A) codegen / VM-state.** A JIT'd function's bytecode runs `CALL_PRIM "map"`
(and `rest`) with args the primitive rejects (`expected (fn list)`, 2 args
pushed but wrong). KEY CLUE: **host conformance under `SX_SERVING_JIT=1` is
271/271** — the SAME functions (host/blog--edges-block etc.) JIT fine when driven
via the epoch `(eval ...)` path. It ONLY miscompiles in the **http-listen +
cek_run_with_io** serving path. So it is not pure codegen — it's triggered by the
serving/IO context. Strong hypothesis: a `perform`/`VmSuspended` earlier in the
request (the handler does durable kv reads) resumes the VM with a misaligned
stack, so the NEXT `CALL_PRIM` (often a `map`) gets wrong args. The map/rest are
just the first prim call after a resume. Worth a `vm-trace` of a handler that
suspends then maps.
**(B) fallback doesn't recover the failed call.** `register_jit_hook`
(`hosts/ocaml/bin/sx_server.ml` ~L1607-1623): on first-call error it warns, sets
`l.l_compiled <- jit_failed_sentinel`, and returns `None` — intended to fall
through to CEK. But the error still escapes to the http-listen handler (→ 500)
instead of the call being re-run on CEK and returning a value. So even granting
(A), the request shouldn't 500: the fallback should recover THIS call, not just
mark the fn for next time. (Your own notes flagged this as the deferred
"propagate-don't-rerun" shared-CEK change — this is the same thing biting live.)
Fixing EITHER (A) or (B) unblocks the host: (A) removes the miscompile; (B) makes
any miscompile self-heal on the first hit instead of 500ing.
## Repro
1. Build the merged binary (loops/host now carries sx-vm-extensions; the gate +
render-page coexist in sx_server.ml's persistent serving branch).
2. `SX_SERVING_JIT=1 bash lib/host/serve.sh` on a port (durable backend), but
FIRST remove the `(jit-exclude! "host/*" ...)` line from serve.sh so host code
JITs.
3. `curl http://127.0.0.1:PORT/welcome/` → first hit 500 (`map: expected (fn list)`),
retry 200. `curl /` (home, uses map+rest) likewise.
Tooling: `(vm-trace "<sx>")`, `(bytecode-inspect "host/blog--edges-block")`,
`(prim-check "host/blog--edges-block")` (CLAUDE.md "VM/Bytecode Debugging").
## Current mitigation (host side, to remove once fixed)
`lib/host/serve.sh`: when `SX_SERVING_JIT=1`, `(jit-exclude! "host/*" "dream-*"
"dr/*")`. Host app + Dream framework run on CEK (they're IO-bound — no perf loss);
Datalog (`dl-*`/`relations-*`) keeps JITting (the win). Drop this once (A)/(B) land.
## Refined data (100% JIT, no exclude, 2026-06-28)
Host now runs at 100% serving JIT (no jit-exclude). Out of **255 successful JIT
compiles, only ~3 functions miscompile**, all on a multi-arg LIST PRIMITIVE with
wrong CALL_PRIM args, all in the durable-read request path, all failing on the
FIRST list-prim call after a `perform` (kv read):
- `host/blog--edges-block``map: expected (fn list) (CALL_PRIM "map" 2 args)`
- a fn using `rest``rest: 1 list arg`
- `host/blog-relate-options``drop: list and number (CALL_PRIM "drop" 2 args)`
Conformance (epoch eval, no http-listen/perform) is 271/271 under JIT — so it's
NOT the data-first swap alone; the **serving/perform path** is the trigger.
Strongly supports the OP_PERFORM-resume stack-misalignment theory: the prim that
fails is just the first CALL_PRIM after the resume. 252+ other fns JIT clean.