Files
rose-ash/plans/jit-bytecode-correctness.md
2026-06-28 16:32:17 +00:00

237 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# JIT bytecode correctness — enable the JIT in serving mode
> Kickoff handed over from the **host-on-sx** loop (2026-06-19). This is the
> highest-leverage perf win on the platform.
## Why this matters
Every SX-on-SX subsystem runs **interpreted on the tree-walking CEK**: the
Smalltalk runtime (→ content-on-sx rendering), and the guest languages
(Datalog, Prolog, APL, Scheme, Haskell, Erlang, Maude). The lazy JIT
(`register_jit_hook` → bytecode VM) would speed all of them up ~1060×. It is
currently **only installed in `--http` page-server mode**, not the epoch /
`http-listen` serving mode — because it **miscompiles** these workloads.
Concrete impact: the host serves a blog post (`content/html`, interpreted
Smalltalk) in **~2 seconds per request**. With a correct JIT it should be tens
of ms. Same slowdown applies to every guest-language-backed service.
## Concrete repro (from the host loop)
In `hosts/ocaml/bin/sx_server.ml`, the persistent server mode (`make_server_env`,
~line 4871) does **not** call `register_jit_hook env` — only the `--http` mode
(~line 4034) does. To reproduce the miscompile:
1. Add `register_jit_hook env;` right after `let env = make_server_env () in` in
the persistent server-mode branch (~4871).
2. Rebuild: `eval $(opam env --switch=5.2.0); dune build bin/sx_server.exe`.
3. Run a Smalltalk/content-heavy suite, e.g. the host-on-sx conformance
(`bash /root/rose-ash-loops/host/lib/host/conformance.sh`, or any
content-on-sx suite). **With the hook ON, tests FAIL** — host-on-sx dropped to
`router 3/6, feed 4/11, relations 9/16, blog 4/11`. With the hook OFF: all green.
So the JIT produces **wrong results** (the known "compiled compiler helpers loop
on complex nested ASTs" — see memory `project_jit_bytecode_bug`).
## Goal
Make the JIT compile the Smalltalk-on-SX evaluator + guest-language evaluators
**correctly**, so `register_jit_hook` can be enabled in serving mode with
conformance **fully green**. Then enable it there.
## Suggested approach
- Minimal repro to bisect: render a `lib/content` doc via `content/html` with JIT
ON vs OFF, diff the output, find the first divergence.
- Localize with the VM debugging tools (see CLAUDE.md): `(vm-trace ...)`,
`(bytecode-inspect ...)`, `(prim-check ...)`, `(deps-check ...)`.
- Likely suspects: nested closures / TCO, dict construction, `st-send` dispatch
patterns, recursion through the Smalltalk method interpreter.
## Pointers
- `register_jit_hook``sx_server.ml` ~1493; JIT VM-suspend/resolve path ~14971514.
- `hosts/ocaml/lib/sx_vm.ml` — the bytecode VM + compiler.
- `plans/jit-cache-architecture.md`, `plans/jit-perf-regression.md`, `restore-jit-perf.sh`.
- Memory: `project_jit_bytecode_bug.md` (plan ref `plans/reflective-rolling-treehouse.md`).
- The shared `sx_server.exe` binary is used by ALL loops — coordinate before
changing VM semantics that could affect sibling conformance runs.
---
## Resolution (2026-06-19, loop loops/sx-vm-extensions)
JIT is now enabled in the persistent (epoch) serving mode (`register_jit_hook`
in `sx_server.ml`'s server-mode branch). Smalltalk conformance is **847/847 —
identical to the no-JIT baseline** (no failures, no double-counted rows).
Datalog conformance (a non-continuation guest) is **356/356** under JIT.
Five distinct root causes were found and fixed (not one "miscompile"):
1. **Serving mode never loaded `lib/compiler.sx`.** The JIT then used the
native `Sx_compiler.compile` stub, which emits arity-0 bytecode with every
parameter compiled as `GLOBAL_GET` → "VM undefined: <param>" on the first
call of essentially every function. `http`/`cli`/`site` modes already load
`compiler.sx`; the epoch serving branch now does too (before the hook).
*Fix: `sx_server.ml` server-mode branch loads `lib/compiler.sx`.*
2. **`compile-cond`/`compile-case-clauses`/`compile-guard-clauses` only treated
the keyword `:else` and `true` as the catch-all** — not the bare symbol
`else` that the CEK's `is-else-clause?` accepts. They emitted
`GLOBAL_GET "else"` → runtime "VM undefined: else".
*Fix: `lib/compiler.sx` — add the symbol-`else` case to all three.*
3. **`OP_DIV` produced a float for non-divisible Integer/Integer** (`1/2` → 0.5)
instead of the exact `Rational` the `/` primitive returns → diverged from CEK
and broke equality vs rational results.
*Fix: `sx_vm.ml` — delegate non-divisible int/int to the `/` primitive.*
4. **`OP_EQ` / `_fast_eq` lacked `Rational`/`ListRef` cases** that the real `=`
primitive's `safe_eq` has → `(= 1/2 1/2)` was false under JIT.
*Fix: `OP_EQ` delegates non-trivial types to the `=` primitive;
`_fast_eq` (also used by `prim_call "="`) gained rational + ListRef cases.*
5. **Continuation-based control flow can't run in the stack VM.** Smalltalk's
non-local return (`^expr`), block escape, and exception unwinding use
`call/cc`; a JIT-compiled frame between a `call/cc` capture and its `(k v)`
invocation cannot transfer control and (via the hook's re-run-on-failure)
double-executes side effects.
*Fix: a general, data-driven exclusion set — `Sx_types.jit_excluded`,
populated from SX via the new `jit-exclude!` primitive, consulted in
`jit_compile_lambda` so it covers BOTH JIT entry points (CEK hook + in-VM
tiered path). `lib/smalltalk/eval.sx` self-declares its continuation-using
dispatch core interpret-only; pure helpers (parsing, lookup, formatting,
arithmetic) still JIT.* One SUnit suite-runner test helper
(`pharo-test-class`) miscompiles under JIT on a specific iteration and is
excluded in the test prelude (`tests/tokenize.sx`).
### Known residual / follow-up
- The hook still **re-runs a failed VM execution via CEK** (always yields the
correct result, but can duplicate side effects if a JIT'd function fails
mid-run after a side effect). `run_tests`'s hook instead propagates non-IO /
non-"VM undefined" exceptions. Adopting that propagate-don't-rerun semantics
in the serving hook would remove the double-execution class entirely, but it
surfaces genuine mid-run miscompiles as errors — so it must land together
with fixing/excluding any function that miscompiles mid-run (e.g.
`pharo-test-class`). Deferred to avoid changing shared VM/CEK semantics under
this loop.
- Other continuation-heavy guests (Scheme, Erlang use `call/cc`) will need
their own `jit-exclude!` declarations for their dispatch cores; the mechanism
is in place. Non-continuation guests (Datalog/Prolog/Haskell/APL) JIT as-is.
- A debug aid was added to the serving hook: `SX_JIT_DENY=name,...` /
`SX_JIT_ONLY=name,...` env vars to bisect which named lambda the VM
mishandles (hook-path only).
---
## Guest-loop regression sweep + safe-default gate (2026-06-19, follow-up)
Host-loop verification found that enabling serving-mode JIT **globally**
regresses continuation-based guest interpreters (the epoch serving mode is the
shared command channel for every loop's conformance runner). Failure modes:
- **VmClosure not callable** — a JIT'd higher-order function returns its inner
closure as a `VmClosure`; the native `callable?` predicate didn't list
`VmClosure`, so `scheme-apply`'s `(callable? proc)` guard rejected it
("scheme-eval: not a procedure: <vm:anon>"). FIXED generally: `callable?`
(all 4 bindings) now accepts `VmClosure`.
- **Continuation escape** — Scheme `call/cc`, Erlang receive, CL conditions,
JS exceptions: a JIT'd frame can't transfer control through a CEK
continuation.
- **Non-terminating miscompile (HANG)** — Erlang/Prolog/Haskell recursive
evaluators miscompiled into an infinite loop (worse than an error: can't
fall back).
### Mechanism
- `jit-exclude!` now accepts a trailing `*` wildcard → namespace-prefix
exclusion (`Sx_types.jit_excluded_prefixes`, checked in
`jit_compile_lambda` for both JIT entry points). One declaration per guest,
robust vs name-lists (which missed e.g. the erlang `vm/dispatcher`).
### Per-guest exclusions added (in each guest's runtime, loaded with it)
| Guest | Declaration | Status under opt-in JIT |
|-------|-------------|--------------------------|
| smalltalk | name-list (dispatch core) + `pharo-test-class` | 847/847 == CEK |
| scheme | `(jit-exclude! "scheme-*" "scm-*")` | flow 166/166 == CEK |
| erlang | `(jit-exclude! "er-*" "erlang-*")` | 530/530 == CEK, no hang |
| prolog | `(jit-exclude! "pl-*")` | 590/590 == CEK |
| common-lisp | `(jit-exclude! "cl-*" "clos-*")` | residual: 6 fail (advanced suites) |
| js | `(jit-exclude! "js-*")` | (verifying) |
| haskell | `(jit-exclude! "hk-*")` | (verifying) |
Not JIT-related (fail identically on CEK and JIT, pre-existing): lua 0/16,
tcl 3/4. apl/datalog/forth/ocaml: clean under JIT as-is (no continuations).
### Safe-default gate
Serving-mode JIT is now **opt-in via `SX_SERVING_JIT=1` (default OFF)** in
`sx_server.ml`. Default behavior is unchanged (no JIT in epoch serving) ⇒
**zero regression** for every sibling loop's conformance. The content/Smalltalk
page server opts in. This bounds risk: guests are validated and excluded
incrementally; until then the default protects them. Common-Lisp's advanced
suites still need investigation before CL is opt-in-clean.
---
## guard / handler-bind under JIT — central recursive PUSH_HANDLER scan (2026-06-20)
Combined-binary integration (my JIT + host render-page) surfaced a third
JIT-unsafe class beyond guest dispatch cores: **`guard`-based error handling**.
The VM's `OP_PUSH_HANDLER` (compiled `guard`) only intercepts a VM-level
`RAISE` (opcode 37) — it does NOT catch the OCaml `Eval_error` the `error`
primitive throws from a CALL/CALL_PRIM in a callee frame. So a JIT-compiled
`guard` silently fails to catch; the thrown error escapes across the JIT frame.
- SOLID break: `host/wrap-errors -> dream-catch-with` (curried:
`(fn (on-error) (fn (next) (fn (req) (guard ...))))`) — middleware suite
7/9 under JIT (9/9 CEK), "kaboom" escaped as Unhandled exception, NOT
fallback-saved (the guard is in an outer frame, the throw in an inner one).
- LATENT (turned out harmless): `host/blog--render-node`'s `guard` — it JIT-
failed then the hook RE-RAN it on CEK where the guard caught (pure render, no
duplicated effects). This is the double-execution residual firing live.
Fix: `code_uses_handler` scans a JIT candidate's bytecode **recursively**
(including nested closure code in the constant pool) for `OP_PUSH_HANDLER`;
`jit_compile_lambda` skips JIT for any match. The recursion is essential —
curried `dream-catch-with` has no PUSH_HANDLER in its own body; the guard is in
a nested `OP_CLOSURE`. Verified: direct + curried cross-frame guards catch
under JIT; host "kaboom" escapes 2 -> 0.
### Remaining (documented, gated): the double-execution residual
The serving hook still re-runs a failed VM execution via CEK (correct result,
duplicated side effects if the function is impure and fails mid-run). The guard
fix removes the common trigger (guard functions no longer JIT). The clean
general fix is propagate-don't-rerun (run_tests' hook semantics) but that
surfaces genuine mid-run miscompiles as errors and must land with fixing/
excluding those — deferred (shared CEK/VM change). The default-OFF gate makes
all of this opt-in, so nothing regresses by default.
---
## common-lisp residual resolved — call/cc-caller exclusion (2026-06-28)
Investigated the 6 CL opt-in-JIT failures. Findings:
- **geometry / mop-trace (0/0) are NOT JIT regressions** — they error "Undefined
symbol: refl-class-chain-depth-with" on BOTH CEK and JIT (the CLOS suites in
conformance.sh don't preload lib/guest/reflective/class-chain.sx). Pre-existing
harness gap; not counted in the 6.
- The **6 real failures** (parse-recover 4, interactive-debugger 2) were all
condition-system continuation escape. cl-restart-case/cl-handler-case/
cl-handler-bind wrap their body in call/cc. When an SX function driving the
condition system (parse-numbers, make-policy-debugger) is JIT-compiled, the
call/cc form runs in a NESTED cek-run where invoking the captured continuation
runs-to-completion-and-returns instead of escaping → restart fails to abort,
body falls through. Seen as accumulation ((1 3 0 3) vs (1 3)) and no-abort
(999 sentinel). Also produced a +3 double-execution over-count (490 vs 487).
Fix: a third interpret-only signal beyond name/prefix and PUSH_HANDLER —
`jit-exclude-callers-of!` registers call/cc-establishing/invoking form names;
`jit_compile_lambda` skips any function whose constant pool (recursively)
references one (`code_refs_escaping_caller`). Guarded so it's a no-op for guests
that don't register. CL registers cl-restart-case/cl-handler-case/cl-handler-bind
(establish) + cl-invoke-restart/cl-invoke-debugger/cl-signal/cl-error-with-debugger
(invoke). Result: **CL under SX_SERVING_JIT=1 = 487/0, exactly matching CEK.**
The three interpret-only signals now: (1) name / "ns-*" prefix [jit-exclude!],
(2) PUSH_HANDLER in bytecode [guard users, structural], (3) references a
registered escaping form [call/cc-establishing callers]. Together they cover the
continuation-unsafe surface without a deep VM continuation rewrite.