Files
rose-ash/plans/jit-bytecode-correctness.md
giles bf298684fd vm-ext: gate serving-JIT behind SX_SERVING_JIT + fix continuation-guest regressions
Enabling the epoch serving-mode JIT globally regressed continuation-based guest
interpreters (the epoch mode is the shared command channel every loop's
conformance runner uses). Two-part fix:

1. SAFE DEFAULT GATE. register_jit_hook in the persistent server branch is now
   opt-in via SX_SERVING_JIT=1 (default OFF). Default behaviour is unchanged
   (no JIT in epoch serving) → zero regression for sibling loops. The
   content/Smalltalk page server opts in.

2. GENERAL FIXES + per-guest interpret-only declarations:
   - callable? (sx_server/run_tests/integration_tests/mcp_tree) now accepts
     VmClosure. A JIT-compiled higher-order function returns its inner closure
     as a VmClosure; callable? previously rejected it, so scheme-apply's
     (callable? proc) guard failed with "not a procedure: <vm:anon>".
   - jit-exclude! gains a trailing-"*" namespace-prefix form
     (Sx_types.jit_excluded_prefixes), the robust way to mark a whole guest
     interpreter interpret-only (a name-list misses functions in extra files —
     it left erlang's vm/dispatcher JIT'd and 13 tests short).
   - Per-guest exclusions in each guest's runtime.sx:
       scheme  "scheme-*" "scm-*"   erlang "er-*" "erlang-*"
       prolog  "pl-*"               common-lisp "cl-*" "clos-*"
       js      "js-*"               haskell "hk-*"

Verified under opt-in JIT (== CEK, no hang): smalltalk 847/847, scheme/flow
166/166, erlang 530/530, prolog 590/590, apl 152/152, js 147/148. Residual
(documented, protected by the default gate): common-lisp 6 fails in advanced
suites (parser-recovery/debugger/CLOS/MOP). lua (0/16) and tcl (3/4) fail
identically on CEK — pre-existing, not JIT. run_tests --jit/no-jit unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:22:40 +00:00

9.1 KiB
Raw Blame History

JIT bytecode correctness — enable the JIT in serving mode

Kickoff handed over from the host-on-sx loop (2026-06-19). This is the highest-leverage perf win on the platform.

Why this matters

Every SX-on-SX subsystem runs interpreted on the tree-walking CEK: the Smalltalk runtime (→ content-on-sx rendering), and the guest languages (Datalog, Prolog, APL, Scheme, Haskell, Erlang, Maude). The lazy JIT (register_jit_hook → bytecode VM) would speed all of them up ~1060×. It is currently only installed in --http page-server mode, not the epoch / http-listen serving mode — because it miscompiles these workloads.

Concrete impact: the host serves a blog post (content/html, interpreted Smalltalk) in ~2 seconds per request. With a correct JIT it should be tens of ms. Same slowdown applies to every guest-language-backed service.

Concrete repro (from the host loop)

In hosts/ocaml/bin/sx_server.ml, the persistent server mode (make_server_env, ~line 4871) does not call register_jit_hook env — only the --http mode (~line 4034) does. To reproduce the miscompile:

  1. Add register_jit_hook env; right after let env = make_server_env () in in the persistent server-mode branch (~4871).
  2. Rebuild: eval $(opam env --switch=5.2.0); dune build bin/sx_server.exe.
  3. Run a Smalltalk/content-heavy suite, e.g. the host-on-sx conformance (bash /root/rose-ash-loops/host/lib/host/conformance.sh, or any content-on-sx suite). With the hook ON, tests FAIL — host-on-sx dropped to router 3/6, feed 4/11, relations 9/16, blog 4/11. With the hook OFF: all green.

So the JIT produces wrong results (the known "compiled compiler helpers loop on complex nested ASTs" — see memory project_jit_bytecode_bug).

Goal

Make the JIT compile the Smalltalk-on-SX evaluator + guest-language evaluators correctly, so register_jit_hook can be enabled in serving mode with conformance fully green. Then enable it there.

Suggested approach

  • Minimal repro to bisect: render a lib/content doc via content/html with JIT ON vs OFF, diff the output, find the first divergence.
  • Localize with the VM debugging tools (see CLAUDE.md): (vm-trace ...), (bytecode-inspect ...), (prim-check ...), (deps-check ...).
  • Likely suspects: nested closures / TCO, dict construction, st-send dispatch patterns, recursion through the Smalltalk method interpreter.

Pointers

  • register_jit_hooksx_server.ml ~1493; JIT VM-suspend/resolve path ~14971514.
  • hosts/ocaml/lib/sx_vm.ml — the bytecode VM + compiler.
  • plans/jit-cache-architecture.md, plans/jit-perf-regression.md, restore-jit-perf.sh.
  • Memory: project_jit_bytecode_bug.md (plan ref plans/reflective-rolling-treehouse.md).
  • The shared sx_server.exe binary is used by ALL loops — coordinate before changing VM semantics that could affect sibling conformance runs.

Resolution (2026-06-19, loop loops/sx-vm-extensions)

JIT is now enabled in the persistent (epoch) serving mode (register_jit_hook in sx_server.ml's server-mode branch). Smalltalk conformance is 847/847 — identical to the no-JIT baseline (no failures, no double-counted rows). Datalog conformance (a non-continuation guest) is 356/356 under JIT.

Five distinct root causes were found and fixed (not one "miscompile"):

  1. Serving mode never loaded lib/compiler.sx. The JIT then used the native Sx_compiler.compile stub, which emits arity-0 bytecode with every parameter compiled as GLOBAL_GET → "VM undefined: " on the first call of essentially every function. http/cli/site modes already load compiler.sx; the epoch serving branch now does too (before the hook). Fix: sx_server.ml server-mode branch loads lib/compiler.sx.

  2. compile-cond/compile-case-clauses/compile-guard-clauses only treated the keyword :else and true as the catch-all — not the bare symbol else that the CEK's is-else-clause? accepts. They emitted GLOBAL_GET "else" → runtime "VM undefined: else". Fix: lib/compiler.sx — add the symbol-else case to all three.

  3. OP_DIV produced a float for non-divisible Integer/Integer (1/2 → 0.5) instead of the exact Rational the / primitive returns → diverged from CEK and broke equality vs rational results. Fix: sx_vm.ml — delegate non-divisible int/int to the / primitive.

  4. OP_EQ / _fast_eq lacked Rational/ListRef cases that the real = primitive's safe_eq has → (= 1/2 1/2) was false under JIT. Fix: OP_EQ delegates non-trivial types to the = primitive; _fast_eq (also used by prim_call "=") gained rational + ListRef cases.

  5. Continuation-based control flow can't run in the stack VM. Smalltalk's non-local return (^expr), block escape, and exception unwinding use call/cc; a JIT-compiled frame between a call/cc capture and its (k v) invocation cannot transfer control and (via the hook's re-run-on-failure) double-executes side effects. Fix: a general, data-driven exclusion set — Sx_types.jit_excluded, populated from SX via the new jit-exclude! primitive, consulted in jit_compile_lambda so it covers BOTH JIT entry points (CEK hook + in-VM tiered path). lib/smalltalk/eval.sx self-declares its continuation-using dispatch core interpret-only; pure helpers (parsing, lookup, formatting, arithmetic) still JIT. One SUnit suite-runner test helper (pharo-test-class) miscompiles under JIT on a specific iteration and is excluded in the test prelude (tests/tokenize.sx).

Known residual / follow-up

  • The hook still re-runs a failed VM execution via CEK (always yields the correct result, but can duplicate side effects if a JIT'd function fails mid-run after a side effect). run_tests's hook instead propagates non-IO / non-"VM undefined" exceptions. Adopting that propagate-don't-rerun semantics in the serving hook would remove the double-execution class entirely, but it surfaces genuine mid-run miscompiles as errors — so it must land together with fixing/excluding any function that miscompiles mid-run (e.g. pharo-test-class). Deferred to avoid changing shared VM/CEK semantics under this loop.
  • Other continuation-heavy guests (Scheme, Erlang use call/cc) will need their own jit-exclude! declarations for their dispatch cores; the mechanism is in place. Non-continuation guests (Datalog/Prolog/Haskell/APL) JIT as-is.
  • A debug aid was added to the serving hook: SX_JIT_DENY=name,... / SX_JIT_ONLY=name,... env vars to bisect which named lambda the VM mishandles (hook-path only).

Guest-loop regression sweep + safe-default gate (2026-06-19, follow-up)

Host-loop verification found that enabling serving-mode JIT globally regresses continuation-based guest interpreters (the epoch serving mode is the shared command channel for every loop's conformance runner). Failure modes:

  • VmClosure not callable — a JIT'd higher-order function returns its inner closure as a VmClosure; the native callable? predicate didn't list VmClosure, so scheme-apply's (callable? proc) guard rejected it ("scheme-eval: not a procedure: vm:anon"). FIXED generally: callable? (all 4 bindings) now accepts VmClosure.
  • Continuation escape — Scheme call/cc, Erlang receive, CL conditions, JS exceptions: a JIT'd frame can't transfer control through a CEK continuation.
  • Non-terminating miscompile (HANG) — Erlang/Prolog/Haskell recursive evaluators miscompiled into an infinite loop (worse than an error: can't fall back).

Mechanism

  • jit-exclude! now accepts a trailing * wildcard → namespace-prefix exclusion (Sx_types.jit_excluded_prefixes, checked in jit_compile_lambda for both JIT entry points). One declaration per guest, robust vs name-lists (which missed e.g. the erlang vm/dispatcher).

Per-guest exclusions added (in each guest's runtime, loaded with it)

Guest Declaration Status under opt-in JIT
smalltalk name-list (dispatch core) + pharo-test-class 847/847 == CEK
scheme (jit-exclude! "scheme-*" "scm-*") flow 166/166 == CEK
erlang (jit-exclude! "er-*" "erlang-*") 530/530 == CEK, no hang
prolog (jit-exclude! "pl-*") 590/590 == CEK
common-lisp (jit-exclude! "cl-*" "clos-*") residual: 6 fail (advanced suites)
js (jit-exclude! "js-*") (verifying)
haskell (jit-exclude! "hk-*") (verifying)

Not JIT-related (fail identically on CEK and JIT, pre-existing): lua 0/16, tcl 3/4. apl/datalog/forth/ocaml: clean under JIT as-is (no continuations).

Safe-default gate

Serving-mode JIT is now opt-in via SX_SERVING_JIT=1 (default OFF) in sx_server.ml. Default behavior is unchanged (no JIT in epoch serving) ⇒ zero regression for every sibling loop's conformance. The content/Smalltalk page server opts in. This bounds risk: guests are validated and excluded incrementally; until then the default protects them. Common-Lisp's advanced suites still need investigation before CL is opt-in-clean.