Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 33s
The serving-JIT perform-in-HO-callback miscompile (map/rest/drop wrong
CALL_PRIM args → blank pages, empty picker) is now fully fixed, so the host
runs 100% serving JIT with NO jit-exclude.
sx-vm-extensions 81177d0e resolves a suspended HO-callback's IO inline
(instead of unwinding the native map/filter loop and corrupting the stack),
but ONLY when a synchronous resolver is installed (!_cek_io_resolver = Some).
The host serves via the http-listen primitive, whose handler drove durable IO
through cek_run_with_io with the resolver = None — so it hit the unwinding
path the fix doesn't cover. (The vm-ext repro installed a resolver, so it
never exercised the host's real no-resolver path.)
Fix: extract cek_run_with_io's IO resolution into resolve_io_request, and have
http-listen install _cek_io_resolver := Some (fun req _ -> resolve_io_request
req) — byte-identical resolution, so the inline path resolves durable reads
exactly as the CEK loop would.
Verified: host conformance 271/271; ephemeral durable server at 100% JIT (no
exclude) zero fallbacks + real content + related shown + picker 12 candidates;
live blog.rose-ash.com home/post/tags 200 with related posts, zero error-log
lines; relate-picker Playwright 4/4 (infinite-scroll + filter + relate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
109 lines
5.9 KiB
Markdown
109 lines
5.9 KiB
Markdown
# Hand-off: serving-mode JIT miscompiles host handlers (to sx-vm-extensions)
|
||
|
||
> ## ✅ RESOLVED 2026-06-28 — host now runs 100% serving JIT, no exclude.
|
||
>
|
||
> Two composing pieces fixed it:
|
||
> 1. **sx-vm-extensions `81177d0e`** (`sx_vm.ml` `call_closure_reuse`): when an
|
||
> HO-primitive callback (map/filter/reduce/…) suspends on a `perform` AND a
|
||
> synchronous resolver is installed, resolve its IO inline and run it to
|
||
> completion instead of unwinding the native loop (which dropped iteration
|
||
> state and misaligned the stack → the next `CALL_PRIM` got wrong args).
|
||
> 2. **host side (`sx_server.ml`)**: that fix only engages when
|
||
> `!_cek_io_resolver = Some`. The host serves via the `http-listen` primitive,
|
||
> whose handler drove durable IO through `cek_run_with_io` with the resolver
|
||
> **= None**, so it hit the unwinding path the fix doesn't cover (the
|
||
> vm-extensions repro `repro_jit_resume.ml` *installed* a resolver, so it never
|
||
> exercised the host's real path). Fix: extracted `cek_run_with_io`'s IO
|
||
> resolution into `resolve_io_request`, and `http-listen` now installs
|
||
> `_cek_io_resolver := Some (fun req _ -> resolve_io_request req)` — byte-
|
||
> identical resolution, so the inline-resolve path resolves durable reads
|
||
> exactly as the CEK loop would.
|
||
>
|
||
> Verified: host conformance **271/271**; ephemeral durable server at 100% JIT
|
||
> (no exclude) — zero fallbacks, real content, related posts shown, picker lists
|
||
> 12 candidates; live blog.rose-ash.com home/post/tags 200 with related posts and
|
||
> zero error-log lines; relate-picker Playwright **4/4** (infinite-scroll +
|
||
> filter + relate, the `drop` path). `serve.sh` exclude dropped.
|
||
>
|
||
> Everything below is the original hand-off, kept for the record.
|
||
|
||
---
|
||
|
||
> From the **host-on-sx** loop, 2026-06-28. We enabled `SX_SERVING_JIT=1` on the
|
||
> live host (blog.rose-ash.com) — the Datalog/relations saturation JITs cleanly
|
||
> and is the real win (host conformance 271/271 under JIT, 5.4× faster; live
|
||
> `/tags` 2.5s → 0.76s). BUT host app handlers MISCOMPILE in the serving path, so
|
||
> we had to `(jit-exclude! "host/*" "dream-*" "dr/*")` in serve.sh as a band-aid.
|
||
> Please fix the underlying bug so the exclude can be dropped.
|
||
|
||
## Symptom
|
||
|
||
Under `SX_SERVING_JIT=1`, the FIRST request to most pages 500s, then self-heals
|
||
(retries 200). stderr shows, paired:
|
||
|
||
```
|
||
[jit] host/blog--edges-block first-call fallback to CEK: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
|
||
[http-listen] handler error: Sx_types.Eval_error("map: expected (fn list) (in CALL_PRIM \"map\" with 2 args)")
|
||
```
|
||
Also seen: `Sx_types.Eval_error("rest: 1 list arg")`.
|
||
|
||
## Two distinct bugs
|
||
|
||
**(A) codegen / VM-state.** A JIT'd function's bytecode runs `CALL_PRIM "map"`
|
||
(and `rest`) with args the primitive rejects (`expected (fn list)`, 2 args
|
||
pushed but wrong). KEY CLUE: **host conformance under `SX_SERVING_JIT=1` is
|
||
271/271** — the SAME functions (host/blog--edges-block etc.) JIT fine when driven
|
||
via the epoch `(eval ...)` path. It ONLY miscompiles in the **http-listen +
|
||
cek_run_with_io** serving path. So it is not pure codegen — it's triggered by the
|
||
serving/IO context. Strong hypothesis: a `perform`/`VmSuspended` earlier in the
|
||
request (the handler does durable kv reads) resumes the VM with a misaligned
|
||
stack, so the NEXT `CALL_PRIM` (often a `map`) gets wrong args. The map/rest are
|
||
just the first prim call after a resume. Worth a `vm-trace` of a handler that
|
||
suspends then maps.
|
||
|
||
**(B) fallback doesn't recover the failed call.** `register_jit_hook`
|
||
(`hosts/ocaml/bin/sx_server.ml` ~L1607-1623): on first-call error it warns, sets
|
||
`l.l_compiled <- jit_failed_sentinel`, and returns `None` — intended to fall
|
||
through to CEK. But the error still escapes to the http-listen handler (→ 500)
|
||
instead of the call being re-run on CEK and returning a value. So even granting
|
||
(A), the request shouldn't 500: the fallback should recover THIS call, not just
|
||
mark the fn for next time. (Your own notes flagged this as the deferred
|
||
"propagate-don't-rerun" shared-CEK change — this is the same thing biting live.)
|
||
|
||
Fixing EITHER (A) or (B) unblocks the host: (A) removes the miscompile; (B) makes
|
||
any miscompile self-heal on the first hit instead of 500ing.
|
||
|
||
## Repro
|
||
|
||
1. Build the merged binary (loops/host now carries sx-vm-extensions; the gate +
|
||
render-page coexist in sx_server.ml's persistent serving branch).
|
||
2. `SX_SERVING_JIT=1 bash lib/host/serve.sh` on a port (durable backend), but
|
||
FIRST remove the `(jit-exclude! "host/*" ...)` line from serve.sh so host code
|
||
JITs.
|
||
3. `curl http://127.0.0.1:PORT/welcome/` → first hit 500 (`map: expected (fn list)`),
|
||
retry 200. `curl /` (home, uses map+rest) likewise.
|
||
|
||
Tooling: `(vm-trace "<sx>")`, `(bytecode-inspect "host/blog--edges-block")`,
|
||
`(prim-check "host/blog--edges-block")` (CLAUDE.md "VM/Bytecode Debugging").
|
||
|
||
## Current mitigation (host side, to remove once fixed)
|
||
|
||
`lib/host/serve.sh`: when `SX_SERVING_JIT=1`, `(jit-exclude! "host/*" "dream-*"
|
||
"dr/*")`. Host app + Dream framework run on CEK (they're IO-bound — no perf loss);
|
||
Datalog (`dl-*`/`relations-*`) keeps JITting (the win). Drop this once (A)/(B) land.
|
||
|
||
## Refined data (100% JIT, no exclude, 2026-06-28)
|
||
|
||
Host now runs at 100% serving JIT (no jit-exclude). Out of **255 successful JIT
|
||
compiles, only ~3 functions miscompile**, all on a multi-arg LIST PRIMITIVE with
|
||
wrong CALL_PRIM args, all in the durable-read request path, all failing on the
|
||
FIRST list-prim call after a `perform` (kv read):
|
||
- `host/blog--edges-block` → `map: expected (fn list) (CALL_PRIM "map" 2 args)`
|
||
- a fn using `rest` → `rest: 1 list arg`
|
||
- `host/blog-relate-options` → `drop: list and number (CALL_PRIM "drop" 2 args)`
|
||
|
||
Conformance (epoch eval, no http-listen/perform) is 271/271 under JIT — so it's
|
||
NOT the data-first swap alone; the **serving/perform path** is the trigger.
|
||
Strongly supports the OP_PERFORM-resume stack-misalignment theory: the prim that
|
||
fails is just the first CALL_PRIM after the resume. 252+ other fns JIT clean.
|