Files
rose-ash/plans/jit-bytecode-correctness.md
giles bf298684fd vm-ext: gate serving-JIT behind SX_SERVING_JIT + fix continuation-guest regressions
Enabling the epoch serving-mode JIT globally regressed continuation-based guest
interpreters (the epoch mode is the shared command channel every loop's
conformance runner uses). Two-part fix:

1. SAFE DEFAULT GATE. register_jit_hook in the persistent server branch is now
   opt-in via SX_SERVING_JIT=1 (default OFF). Default behaviour is unchanged
   (no JIT in epoch serving) → zero regression for sibling loops. The
   content/Smalltalk page server opts in.

2. GENERAL FIXES + per-guest interpret-only declarations:
   - callable? (sx_server/run_tests/integration_tests/mcp_tree) now accepts
     VmClosure. A JIT-compiled higher-order function returns its inner closure
     as a VmClosure; callable? previously rejected it, so scheme-apply's
     (callable? proc) guard failed with "not a procedure: <vm:anon>".
   - jit-exclude! gains a trailing-"*" namespace-prefix form
     (Sx_types.jit_excluded_prefixes), the robust way to mark a whole guest
     interpreter interpret-only (a name-list misses functions in extra files —
     it left erlang's vm/dispatcher JIT'd and 13 tests short).
   - Per-guest exclusions in each guest's runtime.sx:
       scheme  "scheme-*" "scm-*"   erlang "er-*" "erlang-*"
       prolog  "pl-*"               common-lisp "cl-*" "clos-*"
       js      "js-*"               haskell "hk-*"

Verified under opt-in JIT (== CEK, no hang): smalltalk 847/847, scheme/flow
166/166, erlang 530/530, prolog 590/590, apl 152/152, js 147/148. Residual
(documented, protected by the default gate): common-lisp 6 fails in advanced
suites (parser-recovery/debugger/CLOS/MOP). lua (0/16) and tcl (3/4) fail
identically on CEK — pre-existing, not JIT. run_tests --jit/no-jit unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:22:40 +00:00

171 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# JIT bytecode correctness — enable the JIT in serving mode
> Kickoff handed over from the **host-on-sx** loop (2026-06-19). This is the
> highest-leverage perf win on the platform.
## Why this matters
Every SX-on-SX subsystem runs **interpreted on the tree-walking CEK**: the
Smalltalk runtime (→ content-on-sx rendering), and the guest languages
(Datalog, Prolog, APL, Scheme, Haskell, Erlang, Maude). The lazy JIT
(`register_jit_hook` → bytecode VM) would speed all of them up ~1060×. It is
currently **only installed in `--http` page-server mode**, not the epoch /
`http-listen` serving mode — because it **miscompiles** these workloads.
Concrete impact: the host serves a blog post (`content/html`, interpreted
Smalltalk) in **~2 seconds per request**. With a correct JIT it should be tens
of ms. Same slowdown applies to every guest-language-backed service.
## Concrete repro (from the host loop)
In `hosts/ocaml/bin/sx_server.ml`, the persistent server mode (`make_server_env`,
~line 4871) does **not** call `register_jit_hook env` — only the `--http` mode
(~line 4034) does. To reproduce the miscompile:
1. Add `register_jit_hook env;` right after `let env = make_server_env () in` in
the persistent server-mode branch (~4871).
2. Rebuild: `eval $(opam env --switch=5.2.0); dune build bin/sx_server.exe`.
3. Run a Smalltalk/content-heavy suite, e.g. the host-on-sx conformance
(`bash /root/rose-ash-loops/host/lib/host/conformance.sh`, or any
content-on-sx suite). **With the hook ON, tests FAIL** — host-on-sx dropped to
`router 3/6, feed 4/11, relations 9/16, blog 4/11`. With the hook OFF: all green.
So the JIT produces **wrong results** (the known "compiled compiler helpers loop
on complex nested ASTs" — see memory `project_jit_bytecode_bug`).
## Goal
Make the JIT compile the Smalltalk-on-SX evaluator + guest-language evaluators
**correctly**, so `register_jit_hook` can be enabled in serving mode with
conformance **fully green**. Then enable it there.
## Suggested approach
- Minimal repro to bisect: render a `lib/content` doc via `content/html` with JIT
ON vs OFF, diff the output, find the first divergence.
- Localize with the VM debugging tools (see CLAUDE.md): `(vm-trace ...)`,
`(bytecode-inspect ...)`, `(prim-check ...)`, `(deps-check ...)`.
- Likely suspects: nested closures / TCO, dict construction, `st-send` dispatch
patterns, recursion through the Smalltalk method interpreter.
## Pointers
- `register_jit_hook``sx_server.ml` ~1493; JIT VM-suspend/resolve path ~14971514.
- `hosts/ocaml/lib/sx_vm.ml` — the bytecode VM + compiler.
- `plans/jit-cache-architecture.md`, `plans/jit-perf-regression.md`, `restore-jit-perf.sh`.
- Memory: `project_jit_bytecode_bug.md` (plan ref `plans/reflective-rolling-treehouse.md`).
- The shared `sx_server.exe` binary is used by ALL loops — coordinate before
changing VM semantics that could affect sibling conformance runs.
---
## Resolution (2026-06-19, loop loops/sx-vm-extensions)
JIT is now enabled in the persistent (epoch) serving mode (`register_jit_hook`
in `sx_server.ml`'s server-mode branch). Smalltalk conformance is **847/847 —
identical to the no-JIT baseline** (no failures, no double-counted rows).
Datalog conformance (a non-continuation guest) is **356/356** under JIT.
Five distinct root causes were found and fixed (not one "miscompile"):
1. **Serving mode never loaded `lib/compiler.sx`.** The JIT then used the
native `Sx_compiler.compile` stub, which emits arity-0 bytecode with every
parameter compiled as `GLOBAL_GET` → "VM undefined: <param>" on the first
call of essentially every function. `http`/`cli`/`site` modes already load
`compiler.sx`; the epoch serving branch now does too (before the hook).
*Fix: `sx_server.ml` server-mode branch loads `lib/compiler.sx`.*
2. **`compile-cond`/`compile-case-clauses`/`compile-guard-clauses` only treated
the keyword `:else` and `true` as the catch-all** — not the bare symbol
`else` that the CEK's `is-else-clause?` accepts. They emitted
`GLOBAL_GET "else"` → runtime "VM undefined: else".
*Fix: `lib/compiler.sx` — add the symbol-`else` case to all three.*
3. **`OP_DIV` produced a float for non-divisible Integer/Integer** (`1/2` → 0.5)
instead of the exact `Rational` the `/` primitive returns → diverged from CEK
and broke equality vs rational results.
*Fix: `sx_vm.ml` — delegate non-divisible int/int to the `/` primitive.*
4. **`OP_EQ` / `_fast_eq` lacked `Rational`/`ListRef` cases** that the real `=`
primitive's `safe_eq` has → `(= 1/2 1/2)` was false under JIT.
*Fix: `OP_EQ` delegates non-trivial types to the `=` primitive;
`_fast_eq` (also used by `prim_call "="`) gained rational + ListRef cases.*
5. **Continuation-based control flow can't run in the stack VM.** Smalltalk's
non-local return (`^expr`), block escape, and exception unwinding use
`call/cc`; a JIT-compiled frame between a `call/cc` capture and its `(k v)`
invocation cannot transfer control and (via the hook's re-run-on-failure)
double-executes side effects.
*Fix: a general, data-driven exclusion set — `Sx_types.jit_excluded`,
populated from SX via the new `jit-exclude!` primitive, consulted in
`jit_compile_lambda` so it covers BOTH JIT entry points (CEK hook + in-VM
tiered path). `lib/smalltalk/eval.sx` self-declares its continuation-using
dispatch core interpret-only; pure helpers (parsing, lookup, formatting,
arithmetic) still JIT.* One SUnit suite-runner test helper
(`pharo-test-class`) miscompiles under JIT on a specific iteration and is
excluded in the test prelude (`tests/tokenize.sx`).
### Known residual / follow-up
- The hook still **re-runs a failed VM execution via CEK** (always yields the
correct result, but can duplicate side effects if a JIT'd function fails
mid-run after a side effect). `run_tests`'s hook instead propagates non-IO /
non-"VM undefined" exceptions. Adopting that propagate-don't-rerun semantics
in the serving hook would remove the double-execution class entirely, but it
surfaces genuine mid-run miscompiles as errors — so it must land together
with fixing/excluding any function that miscompiles mid-run (e.g.
`pharo-test-class`). Deferred to avoid changing shared VM/CEK semantics under
this loop.
- Other continuation-heavy guests (Scheme, Erlang use `call/cc`) will need
their own `jit-exclude!` declarations for their dispatch cores; the mechanism
is in place. Non-continuation guests (Datalog/Prolog/Haskell/APL) JIT as-is.
- A debug aid was added to the serving hook: `SX_JIT_DENY=name,...` /
`SX_JIT_ONLY=name,...` env vars to bisect which named lambda the VM
mishandles (hook-path only).
---
## Guest-loop regression sweep + safe-default gate (2026-06-19, follow-up)
Host-loop verification found that enabling serving-mode JIT **globally**
regresses continuation-based guest interpreters (the epoch serving mode is the
shared command channel for every loop's conformance runner). Failure modes:
- **VmClosure not callable** — a JIT'd higher-order function returns its inner
closure as a `VmClosure`; the native `callable?` predicate didn't list
`VmClosure`, so `scheme-apply`'s `(callable? proc)` guard rejected it
("scheme-eval: not a procedure: <vm:anon>"). FIXED generally: `callable?`
(all 4 bindings) now accepts `VmClosure`.
- **Continuation escape** — Scheme `call/cc`, Erlang receive, CL conditions,
JS exceptions: a JIT'd frame can't transfer control through a CEK
continuation.
- **Non-terminating miscompile (HANG)** — Erlang/Prolog/Haskell recursive
evaluators miscompiled into an infinite loop (worse than an error: can't
fall back).
### Mechanism
- `jit-exclude!` now accepts a trailing `*` wildcard → namespace-prefix
exclusion (`Sx_types.jit_excluded_prefixes`, checked in
`jit_compile_lambda` for both JIT entry points). One declaration per guest,
robust vs name-lists (which missed e.g. the erlang `vm/dispatcher`).
### Per-guest exclusions added (in each guest's runtime, loaded with it)
| Guest | Declaration | Status under opt-in JIT |
|-------|-------------|--------------------------|
| smalltalk | name-list (dispatch core) + `pharo-test-class` | 847/847 == CEK |
| scheme | `(jit-exclude! "scheme-*" "scm-*")` | flow 166/166 == CEK |
| erlang | `(jit-exclude! "er-*" "erlang-*")` | 530/530 == CEK, no hang |
| prolog | `(jit-exclude! "pl-*")` | 590/590 == CEK |
| common-lisp | `(jit-exclude! "cl-*" "clos-*")` | residual: 6 fail (advanced suites) |
| js | `(jit-exclude! "js-*")` | (verifying) |
| haskell | `(jit-exclude! "hk-*")` | (verifying) |
Not JIT-related (fail identically on CEK and JIT, pre-existing): lua 0/16,
tcl 3/4. apl/datalog/forth/ocaml: clean under JIT as-is (no continuations).
### Safe-default gate
Serving-mode JIT is now **opt-in via `SX_SERVING_JIT=1` (default OFF)** in
`sx_server.ml`. Default behavior is unchanged (no JIT in epoch serving) ⇒
**zero regression** for every sibling loop's conformance. The content/Smalltalk
page server opts in. This bounds risk: guests are validated and excluded
incrementally; until then the default protects them. Common-Lisp's advanced
suites still need investigation before CL is opt-in-clean.