13 KiB
JIT bytecode correctness — enable the JIT in serving mode
Kickoff handed over from the host-on-sx loop (2026-06-19). This is the highest-leverage perf win on the platform.
Why this matters
Every SX-on-SX subsystem runs interpreted on the tree-walking CEK: the
Smalltalk runtime (→ content-on-sx rendering), and the guest languages
(Datalog, Prolog, APL, Scheme, Haskell, Erlang, Maude). The lazy JIT
(register_jit_hook → bytecode VM) would speed all of them up ~10–60×. It is
currently only installed in --http page-server mode, not the epoch /
http-listen serving mode — because it miscompiles these workloads.
Concrete impact: the host serves a blog post (content/html, interpreted
Smalltalk) in ~2 seconds per request. With a correct JIT it should be tens
of ms. Same slowdown applies to every guest-language-backed service.
Concrete repro (from the host loop)
In hosts/ocaml/bin/sx_server.ml, the persistent server mode (make_server_env,
~line 4871) does not call register_jit_hook env — only the --http mode
(~line 4034) does. To reproduce the miscompile:
- Add
register_jit_hook env;right afterlet env = make_server_env () inin the persistent server-mode branch (~4871). - Rebuild:
eval $(opam env --switch=5.2.0); dune build bin/sx_server.exe. - Run a Smalltalk/content-heavy suite, e.g. the host-on-sx conformance
(
bash /root/rose-ash-loops/host/lib/host/conformance.sh, or any content-on-sx suite). With the hook ON, tests FAIL — host-on-sx dropped torouter 3/6, feed 4/11, relations 9/16, blog 4/11. With the hook OFF: all green.
So the JIT produces wrong results (the known "compiled compiler helpers loop
on complex nested ASTs" — see memory project_jit_bytecode_bug).
Goal
Make the JIT compile the Smalltalk-on-SX evaluator + guest-language evaluators
correctly, so register_jit_hook can be enabled in serving mode with
conformance fully green. Then enable it there.
Suggested approach
- Minimal repro to bisect: render a
lib/contentdoc viacontent/htmlwith JIT ON vs OFF, diff the output, find the first divergence. - Localize with the VM debugging tools (see CLAUDE.md):
(vm-trace ...),(bytecode-inspect ...),(prim-check ...),(deps-check ...). - Likely suspects: nested closures / TCO, dict construction,
st-senddispatch patterns, recursion through the Smalltalk method interpreter.
Pointers
register_jit_hook—sx_server.ml~1493; JIT VM-suspend/resolve path ~1497–1514.hosts/ocaml/lib/sx_vm.ml— the bytecode VM + compiler.plans/jit-cache-architecture.md,plans/jit-perf-regression.md,restore-jit-perf.sh.- Memory:
project_jit_bytecode_bug.md(plan refplans/reflective-rolling-treehouse.md). - The shared
sx_server.exebinary is used by ALL loops — coordinate before changing VM semantics that could affect sibling conformance runs.
Resolution (2026-06-19, loop loops/sx-vm-extensions)
JIT is now enabled in the persistent (epoch) serving mode (register_jit_hook
in sx_server.ml's server-mode branch). Smalltalk conformance is 847/847 —
identical to the no-JIT baseline (no failures, no double-counted rows).
Datalog conformance (a non-continuation guest) is 356/356 under JIT.
Five distinct root causes were found and fixed (not one "miscompile"):
-
Serving mode never loaded
lib/compiler.sx. The JIT then used the nativeSx_compiler.compilestub, which emits arity-0 bytecode with every parameter compiled asGLOBAL_GET→ "VM undefined: " on the first call of essentially every function.http/cli/sitemodes already loadcompiler.sx; the epoch serving branch now does too (before the hook). Fix:sx_server.mlserver-mode branch loadslib/compiler.sx. -
compile-cond/compile-case-clauses/compile-guard-clausesonly treated the keyword:elseandtrueas the catch-all — not the bare symbolelsethat the CEK'sis-else-clause?accepts. They emittedGLOBAL_GET "else"→ runtime "VM undefined: else". Fix:lib/compiler.sx— add the symbol-elsecase to all three. -
OP_DIVproduced a float for non-divisible Integer/Integer (1/2→ 0.5) instead of the exactRationalthe/primitive returns → diverged from CEK and broke equality vs rational results. Fix:sx_vm.ml— delegate non-divisible int/int to the/primitive. -
OP_EQ/_fast_eqlackedRational/ListRefcases that the real=primitive'ssafe_eqhas →(= 1/2 1/2)was false under JIT. Fix:OP_EQdelegates non-trivial types to the=primitive;_fast_eq(also used byprim_call "=") gained rational + ListRef cases. -
Continuation-based control flow can't run in the stack VM. Smalltalk's non-local return (
^expr), block escape, and exception unwinding usecall/cc; a JIT-compiled frame between acall/cccapture and its(k v)invocation cannot transfer control and (via the hook's re-run-on-failure) double-executes side effects. Fix: a general, data-driven exclusion set —Sx_types.jit_excluded, populated from SX via the newjit-exclude!primitive, consulted injit_compile_lambdaso it covers BOTH JIT entry points (CEK hook + in-VM tiered path).lib/smalltalk/eval.sxself-declares its continuation-using dispatch core interpret-only; pure helpers (parsing, lookup, formatting, arithmetic) still JIT. One SUnit suite-runner test helper (pharo-test-class) miscompiles under JIT on a specific iteration and is excluded in the test prelude (tests/tokenize.sx).
Known residual / follow-up
- The hook still re-runs a failed VM execution via CEK (always yields the
correct result, but can duplicate side effects if a JIT'd function fails
mid-run after a side effect).
run_tests's hook instead propagates non-IO / non-"VM undefined" exceptions. Adopting that propagate-don't-rerun semantics in the serving hook would remove the double-execution class entirely, but it surfaces genuine mid-run miscompiles as errors — so it must land together with fixing/excluding any function that miscompiles mid-run (e.g.pharo-test-class). Deferred to avoid changing shared VM/CEK semantics under this loop. - Other continuation-heavy guests (Scheme, Erlang use
call/cc) will need their ownjit-exclude!declarations for their dispatch cores; the mechanism is in place. Non-continuation guests (Datalog/Prolog/Haskell/APL) JIT as-is. - A debug aid was added to the serving hook:
SX_JIT_DENY=name,.../SX_JIT_ONLY=name,...env vars to bisect which named lambda the VM mishandles (hook-path only).
Guest-loop regression sweep + safe-default gate (2026-06-19, follow-up)
Host-loop verification found that enabling serving-mode JIT globally regresses continuation-based guest interpreters (the epoch serving mode is the shared command channel for every loop's conformance runner). Failure modes:
- VmClosure not callable — a JIT'd higher-order function returns its inner
closure as a
VmClosure; the nativecallable?predicate didn't listVmClosure, soscheme-apply's(callable? proc)guard rejected it ("scheme-eval: not a procedure: vm:anon"). FIXED generally:callable?(all 4 bindings) now acceptsVmClosure. - Continuation escape — Scheme
call/cc, Erlang receive, CL conditions, JS exceptions: a JIT'd frame can't transfer control through a CEK continuation. - Non-terminating miscompile (HANG) — Erlang/Prolog/Haskell recursive evaluators miscompiled into an infinite loop (worse than an error: can't fall back).
Mechanism
jit-exclude!now accepts a trailing*wildcard → namespace-prefix exclusion (Sx_types.jit_excluded_prefixes, checked injit_compile_lambdafor both JIT entry points). One declaration per guest, robust vs name-lists (which missed e.g. the erlangvm/dispatcher).
Per-guest exclusions added (in each guest's runtime, loaded with it)
| Guest | Declaration | Status under opt-in JIT |
|---|---|---|
| smalltalk | name-list (dispatch core) + pharo-test-class |
847/847 == CEK |
| scheme | (jit-exclude! "scheme-*" "scm-*") |
flow 166/166 == CEK |
| erlang | (jit-exclude! "er-*" "erlang-*") |
530/530 == CEK, no hang |
| prolog | (jit-exclude! "pl-*") |
590/590 == CEK |
| common-lisp | (jit-exclude! "cl-*" "clos-*") |
residual: 6 fail (advanced suites) |
| js | (jit-exclude! "js-*") |
(verifying) |
| haskell | (jit-exclude! "hk-*") |
(verifying) |
Not JIT-related (fail identically on CEK and JIT, pre-existing): lua 0/16, tcl 3/4. apl/datalog/forth/ocaml: clean under JIT as-is (no continuations).
Safe-default gate
Serving-mode JIT is now opt-in via SX_SERVING_JIT=1 (default OFF) in
sx_server.ml. Default behavior is unchanged (no JIT in epoch serving) ⇒
zero regression for every sibling loop's conformance. The content/Smalltalk
page server opts in. This bounds risk: guests are validated and excluded
incrementally; until then the default protects them. Common-Lisp's advanced
suites still need investigation before CL is opt-in-clean.
guard / handler-bind under JIT — central recursive PUSH_HANDLER scan (2026-06-20)
Combined-binary integration (my JIT + host render-page) surfaced a third
JIT-unsafe class beyond guest dispatch cores: guard-based error handling.
The VM's OP_PUSH_HANDLER (compiled guard) only intercepts a VM-level
RAISE (opcode 37) — it does NOT catch the OCaml Eval_error the error
primitive throws from a CALL/CALL_PRIM in a callee frame. So a JIT-compiled
guard silently fails to catch; the thrown error escapes across the JIT frame.
- SOLID break:
host/wrap-errors -> dream-catch-with(curried:(fn (on-error) (fn (next) (fn (req) (guard ...))))) — middleware suite 7/9 under JIT (9/9 CEK), "kaboom" escaped as Unhandled exception, NOT fallback-saved (the guard is in an outer frame, the throw in an inner one). - LATENT (turned out harmless):
host/blog--render-node'sguard— it JIT- failed then the hook RE-RAN it on CEK where the guard caught (pure render, no duplicated effects). This is the double-execution residual firing live.
Fix: code_uses_handler scans a JIT candidate's bytecode recursively
(including nested closure code in the constant pool) for OP_PUSH_HANDLER;
jit_compile_lambda skips JIT for any match. The recursion is essential —
curried dream-catch-with has no PUSH_HANDLER in its own body; the guard is in
a nested OP_CLOSURE. Verified: direct + curried cross-frame guards catch
under JIT; host "kaboom" escapes 2 -> 0.
Remaining (documented, gated): the double-execution residual
The serving hook still re-runs a failed VM execution via CEK (correct result, duplicated side effects if the function is impure and fails mid-run). The guard fix removes the common trigger (guard functions no longer JIT). The clean general fix is propagate-don't-rerun (run_tests' hook semantics) but that surfaces genuine mid-run miscompiles as errors and must land with fixing/ excluding those — deferred (shared CEK/VM change). The default-OFF gate makes all of this opt-in, so nothing regresses by default.
common-lisp residual resolved — call/cc-caller exclusion (2026-06-28)
Investigated the 6 CL opt-in-JIT failures. Findings:
- geometry / mop-trace (0/0) are NOT JIT regressions — they error "Undefined symbol: refl-class-chain-depth-with" on BOTH CEK and JIT (the CLOS suites in conformance.sh don't preload lib/guest/reflective/class-chain.sx). Pre-existing harness gap; not counted in the 6.
- The 6 real failures (parse-recover 4, interactive-debugger 2) were all condition-system continuation escape. cl-restart-case/cl-handler-case/ cl-handler-bind wrap their body in call/cc. When an SX function driving the condition system (parse-numbers, make-policy-debugger) is JIT-compiled, the call/cc form runs in a NESTED cek-run where invoking the captured continuation runs-to-completion-and-returns instead of escaping → restart fails to abort, body falls through. Seen as accumulation ((1 3 0 3) vs (1 3)) and no-abort (999 sentinel). Also produced a +3 double-execution over-count (490 vs 487).
Fix: a third interpret-only signal beyond name/prefix and PUSH_HANDLER —
jit-exclude-callers-of! registers call/cc-establishing/invoking form names;
jit_compile_lambda skips any function whose constant pool (recursively)
references one (code_refs_escaping_caller). Guarded so it's a no-op for guests
that don't register. CL registers cl-restart-case/cl-handler-case/cl-handler-bind
(establish) + cl-invoke-restart/cl-invoke-debugger/cl-signal/cl-error-with-debugger
(invoke). Result: CL under SX_SERVING_JIT=1 = 487/0, exactly matching CEK.
The three interpret-only signals now: (1) name / "ns-*" prefix [jit-exclude!], (2) PUSH_HANDLER in bytecode [guard users, structural], (3) references a registered escaping form [call/cc-establishing callers]. Together they cover the continuation-unsafe surface without a deep VM continuation rewrite.