Merge branch 'loops/sx-vm-extensions' into scratch/host-jit
# Conflicts: # hosts/ocaml/bin/sx_server.ml # lib/erlang/runtime.sx
This commit is contained in:
236
plans/jit-bytecode-correctness.md
Normal file
236
plans/jit-bytecode-correctness.md
Normal file
@@ -0,0 +1,236 @@
|
||||
# JIT bytecode correctness — enable the JIT in serving mode
|
||||
|
||||
> Kickoff handed over from the **host-on-sx** loop (2026-06-19). This is the
|
||||
> highest-leverage perf win on the platform.
|
||||
|
||||
## Why this matters
|
||||
|
||||
Every SX-on-SX subsystem runs **interpreted on the tree-walking CEK**: the
|
||||
Smalltalk runtime (→ content-on-sx rendering), and the guest languages
|
||||
(Datalog, Prolog, APL, Scheme, Haskell, Erlang, Maude). The lazy JIT
|
||||
(`register_jit_hook` → bytecode VM) would speed all of them up ~10–60×. It is
|
||||
currently **only installed in `--http` page-server mode**, not the epoch /
|
||||
`http-listen` serving mode — because it **miscompiles** these workloads.
|
||||
|
||||
Concrete impact: the host serves a blog post (`content/html`, interpreted
|
||||
Smalltalk) in **~2 seconds per request**. With a correct JIT it should be tens
|
||||
of ms. Same slowdown applies to every guest-language-backed service.
|
||||
|
||||
## Concrete repro (from the host loop)
|
||||
|
||||
In `hosts/ocaml/bin/sx_server.ml`, the persistent server mode (`make_server_env`,
|
||||
~line 4871) does **not** call `register_jit_hook env` — only the `--http` mode
|
||||
(~line 4034) does. To reproduce the miscompile:
|
||||
|
||||
1. Add `register_jit_hook env;` right after `let env = make_server_env () in` in
|
||||
the persistent server-mode branch (~4871).
|
||||
2. Rebuild: `eval $(opam env --switch=5.2.0); dune build bin/sx_server.exe`.
|
||||
3. Run a Smalltalk/content-heavy suite, e.g. the host-on-sx conformance
|
||||
(`bash /root/rose-ash-loops/host/lib/host/conformance.sh`, or any
|
||||
content-on-sx suite). **With the hook ON, tests FAIL** — host-on-sx dropped to
|
||||
`router 3/6, feed 4/11, relations 9/16, blog 4/11`. With the hook OFF: all green.
|
||||
|
||||
So the JIT produces **wrong results** (the known "compiled compiler helpers loop
|
||||
on complex nested ASTs" — see memory `project_jit_bytecode_bug`).
|
||||
|
||||
## Goal
|
||||
|
||||
Make the JIT compile the Smalltalk-on-SX evaluator + guest-language evaluators
|
||||
**correctly**, so `register_jit_hook` can be enabled in serving mode with
|
||||
conformance **fully green**. Then enable it there.
|
||||
|
||||
## Suggested approach
|
||||
|
||||
- Minimal repro to bisect: render a `lib/content` doc via `content/html` with JIT
|
||||
ON vs OFF, diff the output, find the first divergence.
|
||||
- Localize with the VM debugging tools (see CLAUDE.md): `(vm-trace ...)`,
|
||||
`(bytecode-inspect ...)`, `(prim-check ...)`, `(deps-check ...)`.
|
||||
- Likely suspects: nested closures / TCO, dict construction, `st-send` dispatch
|
||||
patterns, recursion through the Smalltalk method interpreter.
|
||||
|
||||
## Pointers
|
||||
|
||||
- `register_jit_hook` — `sx_server.ml` ~1493; JIT VM-suspend/resolve path ~1497–1514.
|
||||
- `hosts/ocaml/lib/sx_vm.ml` — the bytecode VM + compiler.
|
||||
- `plans/jit-cache-architecture.md`, `plans/jit-perf-regression.md`, `restore-jit-perf.sh`.
|
||||
- Memory: `project_jit_bytecode_bug.md` (plan ref `plans/reflective-rolling-treehouse.md`).
|
||||
- The shared `sx_server.exe` binary is used by ALL loops — coordinate before
|
||||
changing VM semantics that could affect sibling conformance runs.
|
||||
|
||||
---
|
||||
|
||||
## Resolution (2026-06-19, loop loops/sx-vm-extensions)
|
||||
|
||||
JIT is now enabled in the persistent (epoch) serving mode (`register_jit_hook`
|
||||
in `sx_server.ml`'s server-mode branch). Smalltalk conformance is **847/847 —
|
||||
identical to the no-JIT baseline** (no failures, no double-counted rows).
|
||||
Datalog conformance (a non-continuation guest) is **356/356** under JIT.
|
||||
|
||||
Five distinct root causes were found and fixed (not one "miscompile"):
|
||||
|
||||
1. **Serving mode never loaded `lib/compiler.sx`.** The JIT then used the
|
||||
native `Sx_compiler.compile` stub, which emits arity-0 bytecode with every
|
||||
parameter compiled as `GLOBAL_GET` → "VM undefined: <param>" on the first
|
||||
call of essentially every function. `http`/`cli`/`site` modes already load
|
||||
`compiler.sx`; the epoch serving branch now does too (before the hook).
|
||||
*Fix: `sx_server.ml` server-mode branch loads `lib/compiler.sx`.*
|
||||
|
||||
2. **`compile-cond`/`compile-case-clauses`/`compile-guard-clauses` only treated
|
||||
the keyword `:else` and `true` as the catch-all** — not the bare symbol
|
||||
`else` that the CEK's `is-else-clause?` accepts. They emitted
|
||||
`GLOBAL_GET "else"` → runtime "VM undefined: else".
|
||||
*Fix: `lib/compiler.sx` — add the symbol-`else` case to all three.*
|
||||
|
||||
3. **`OP_DIV` produced a float for non-divisible Integer/Integer** (`1/2` → 0.5)
|
||||
instead of the exact `Rational` the `/` primitive returns → diverged from CEK
|
||||
and broke equality vs rational results.
|
||||
*Fix: `sx_vm.ml` — delegate non-divisible int/int to the `/` primitive.*
|
||||
|
||||
4. **`OP_EQ` / `_fast_eq` lacked `Rational`/`ListRef` cases** that the real `=`
|
||||
primitive's `safe_eq` has → `(= 1/2 1/2)` was false under JIT.
|
||||
*Fix: `OP_EQ` delegates non-trivial types to the `=` primitive;
|
||||
`_fast_eq` (also used by `prim_call "="`) gained rational + ListRef cases.*
|
||||
|
||||
5. **Continuation-based control flow can't run in the stack VM.** Smalltalk's
|
||||
non-local return (`^expr`), block escape, and exception unwinding use
|
||||
`call/cc`; a JIT-compiled frame between a `call/cc` capture and its `(k v)`
|
||||
invocation cannot transfer control and (via the hook's re-run-on-failure)
|
||||
double-executes side effects.
|
||||
*Fix: a general, data-driven exclusion set — `Sx_types.jit_excluded`,
|
||||
populated from SX via the new `jit-exclude!` primitive, consulted in
|
||||
`jit_compile_lambda` so it covers BOTH JIT entry points (CEK hook + in-VM
|
||||
tiered path). `lib/smalltalk/eval.sx` self-declares its continuation-using
|
||||
dispatch core interpret-only; pure helpers (parsing, lookup, formatting,
|
||||
arithmetic) still JIT.* One SUnit suite-runner test helper
|
||||
(`pharo-test-class`) miscompiles under JIT on a specific iteration and is
|
||||
excluded in the test prelude (`tests/tokenize.sx`).
|
||||
|
||||
### Known residual / follow-up
|
||||
- The hook still **re-runs a failed VM execution via CEK** (always yields the
|
||||
correct result, but can duplicate side effects if a JIT'd function fails
|
||||
mid-run after a side effect). `run_tests`'s hook instead propagates non-IO /
|
||||
non-"VM undefined" exceptions. Adopting that propagate-don't-rerun semantics
|
||||
in the serving hook would remove the double-execution class entirely, but it
|
||||
surfaces genuine mid-run miscompiles as errors — so it must land together
|
||||
with fixing/excluding any function that miscompiles mid-run (e.g.
|
||||
`pharo-test-class`). Deferred to avoid changing shared VM/CEK semantics under
|
||||
this loop.
|
||||
- Other continuation-heavy guests (Scheme, Erlang use `call/cc`) will need
|
||||
their own `jit-exclude!` declarations for their dispatch cores; the mechanism
|
||||
is in place. Non-continuation guests (Datalog/Prolog/Haskell/APL) JIT as-is.
|
||||
- A debug aid was added to the serving hook: `SX_JIT_DENY=name,...` /
|
||||
`SX_JIT_ONLY=name,...` env vars to bisect which named lambda the VM
|
||||
mishandles (hook-path only).
|
||||
|
||||
---
|
||||
|
||||
## Guest-loop regression sweep + safe-default gate (2026-06-19, follow-up)
|
||||
|
||||
Host-loop verification found that enabling serving-mode JIT **globally**
|
||||
regresses continuation-based guest interpreters (the epoch serving mode is the
|
||||
shared command channel for every loop's conformance runner). Failure modes:
|
||||
- **VmClosure not callable** — a JIT'd higher-order function returns its inner
|
||||
closure as a `VmClosure`; the native `callable?` predicate didn't list
|
||||
`VmClosure`, so `scheme-apply`'s `(callable? proc)` guard rejected it
|
||||
("scheme-eval: not a procedure: <vm:anon>"). FIXED generally: `callable?`
|
||||
(all 4 bindings) now accepts `VmClosure`.
|
||||
- **Continuation escape** — Scheme `call/cc`, Erlang receive, CL conditions,
|
||||
JS exceptions: a JIT'd frame can't transfer control through a CEK
|
||||
continuation.
|
||||
- **Non-terminating miscompile (HANG)** — Erlang/Prolog/Haskell recursive
|
||||
evaluators miscompiled into an infinite loop (worse than an error: can't
|
||||
fall back).
|
||||
|
||||
### Mechanism
|
||||
- `jit-exclude!` now accepts a trailing `*` wildcard → namespace-prefix
|
||||
exclusion (`Sx_types.jit_excluded_prefixes`, checked in
|
||||
`jit_compile_lambda` for both JIT entry points). One declaration per guest,
|
||||
robust vs name-lists (which missed e.g. the erlang `vm/dispatcher`).
|
||||
|
||||
### Per-guest exclusions added (in each guest's runtime, loaded with it)
|
||||
| Guest | Declaration | Status under opt-in JIT |
|
||||
|-------|-------------|--------------------------|
|
||||
| smalltalk | name-list (dispatch core) + `pharo-test-class` | 847/847 == CEK |
|
||||
| scheme | `(jit-exclude! "scheme-*" "scm-*")` | flow 166/166 == CEK |
|
||||
| erlang | `(jit-exclude! "er-*" "erlang-*")` | 530/530 == CEK, no hang |
|
||||
| prolog | `(jit-exclude! "pl-*")` | 590/590 == CEK |
|
||||
| common-lisp | `(jit-exclude! "cl-*" "clos-*")` | residual: 6 fail (advanced suites) |
|
||||
| js | `(jit-exclude! "js-*")` | (verifying) |
|
||||
| haskell | `(jit-exclude! "hk-*")` | (verifying) |
|
||||
|
||||
Not JIT-related (fail identically on CEK and JIT, pre-existing): lua 0/16,
|
||||
tcl 3/4. apl/datalog/forth/ocaml: clean under JIT as-is (no continuations).
|
||||
|
||||
### Safe-default gate
|
||||
Serving-mode JIT is now **opt-in via `SX_SERVING_JIT=1` (default OFF)** in
|
||||
`sx_server.ml`. Default behavior is unchanged (no JIT in epoch serving) ⇒
|
||||
**zero regression** for every sibling loop's conformance. The content/Smalltalk
|
||||
page server opts in. This bounds risk: guests are validated and excluded
|
||||
incrementally; until then the default protects them. Common-Lisp's advanced
|
||||
suites still need investigation before CL is opt-in-clean.
|
||||
|
||||
---
|
||||
|
||||
## guard / handler-bind under JIT — central recursive PUSH_HANDLER scan (2026-06-20)
|
||||
|
||||
Combined-binary integration (my JIT + host render-page) surfaced a third
|
||||
JIT-unsafe class beyond guest dispatch cores: **`guard`-based error handling**.
|
||||
The VM's `OP_PUSH_HANDLER` (compiled `guard`) only intercepts a VM-level
|
||||
`RAISE` (opcode 37) — it does NOT catch the OCaml `Eval_error` the `error`
|
||||
primitive throws from a CALL/CALL_PRIM in a callee frame. So a JIT-compiled
|
||||
`guard` silently fails to catch; the thrown error escapes across the JIT frame.
|
||||
|
||||
- SOLID break: `host/wrap-errors -> dream-catch-with` (curried:
|
||||
`(fn (on-error) (fn (next) (fn (req) (guard ...))))`) — middleware suite
|
||||
7/9 under JIT (9/9 CEK), "kaboom" escaped as Unhandled exception, NOT
|
||||
fallback-saved (the guard is in an outer frame, the throw in an inner one).
|
||||
- LATENT (turned out harmless): `host/blog--render-node`'s `guard` — it JIT-
|
||||
failed then the hook RE-RAN it on CEK where the guard caught (pure render, no
|
||||
duplicated effects). This is the double-execution residual firing live.
|
||||
|
||||
Fix: `code_uses_handler` scans a JIT candidate's bytecode **recursively**
|
||||
(including nested closure code in the constant pool) for `OP_PUSH_HANDLER`;
|
||||
`jit_compile_lambda` skips JIT for any match. The recursion is essential —
|
||||
curried `dream-catch-with` has no PUSH_HANDLER in its own body; the guard is in
|
||||
a nested `OP_CLOSURE`. Verified: direct + curried cross-frame guards catch
|
||||
under JIT; host "kaboom" escapes 2 -> 0.
|
||||
|
||||
### Remaining (documented, gated): the double-execution residual
|
||||
The serving hook still re-runs a failed VM execution via CEK (correct result,
|
||||
duplicated side effects if the function is impure and fails mid-run). The guard
|
||||
fix removes the common trigger (guard functions no longer JIT). The clean
|
||||
general fix is propagate-don't-rerun (run_tests' hook semantics) but that
|
||||
surfaces genuine mid-run miscompiles as errors and must land with fixing/
|
||||
excluding those — deferred (shared CEK/VM change). The default-OFF gate makes
|
||||
all of this opt-in, so nothing regresses by default.
|
||||
|
||||
---
|
||||
|
||||
## common-lisp residual resolved — call/cc-caller exclusion (2026-06-28)
|
||||
|
||||
Investigated the 6 CL opt-in-JIT failures. Findings:
|
||||
- **geometry / mop-trace (0/0) are NOT JIT regressions** — they error "Undefined
|
||||
symbol: refl-class-chain-depth-with" on BOTH CEK and JIT (the CLOS suites in
|
||||
conformance.sh don't preload lib/guest/reflective/class-chain.sx). Pre-existing
|
||||
harness gap; not counted in the 6.
|
||||
- The **6 real failures** (parse-recover 4, interactive-debugger 2) were all
|
||||
condition-system continuation escape. cl-restart-case/cl-handler-case/
|
||||
cl-handler-bind wrap their body in call/cc. When an SX function driving the
|
||||
condition system (parse-numbers, make-policy-debugger) is JIT-compiled, the
|
||||
call/cc form runs in a NESTED cek-run where invoking the captured continuation
|
||||
runs-to-completion-and-returns instead of escaping → restart fails to abort,
|
||||
body falls through. Seen as accumulation ((1 3 0 3) vs (1 3)) and no-abort
|
||||
(999 sentinel). Also produced a +3 double-execution over-count (490 vs 487).
|
||||
|
||||
Fix: a third interpret-only signal beyond name/prefix and PUSH_HANDLER —
|
||||
`jit-exclude-callers-of!` registers call/cc-establishing/invoking form names;
|
||||
`jit_compile_lambda` skips any function whose constant pool (recursively)
|
||||
references one (`code_refs_escaping_caller`). Guarded so it's a no-op for guests
|
||||
that don't register. CL registers cl-restart-case/cl-handler-case/cl-handler-bind
|
||||
(establish) + cl-invoke-restart/cl-invoke-debugger/cl-signal/cl-error-with-debugger
|
||||
(invoke). Result: **CL under SX_SERVING_JIT=1 = 487/0, exactly matching CEK.**
|
||||
|
||||
The three interpret-only signals now: (1) name / "ns-*" prefix [jit-exclude!],
|
||||
(2) PUSH_HANDLER in bytecode [guard users, structural], (3) references a
|
||||
registered escaping form [call/cc-establishing callers]. Together they cover the
|
||||
continuation-unsafe surface without a deep VM continuation rewrite.
|
||||
Reference in New Issue
Block a user