Files
rose-ash/plans/jit-perf-regression.md
giles 59bec68dcc perf: Phase 6 — substrate perf-regression alarm (perf-smoke)
Replaces the watchdog-bump approach with an automated check. The next 5× (or
worse) substrate regression will trip the alarm at build time instead of
hiding behind a deadline bump and only being noticed weeks later.

Components:

* lib/perf-smoke.sx — four micro-benchmarks chosen for distinct substrate
  failure modes: function-call dispatch (fib), env construction (let-chain),
  HO-form dispatch + lambda creation (map-sq), TCO + primitive dispatch
  (tail-loop). Warm-up pass populates JIT cache before the timed pass so we
  measure the steady state.

* scripts/perf-smoke.sh — pipes lib/perf-smoke.sx to sx_server.exe, parses
  per-bench wall-time, asserts each is within FACTOR× of the recorded
  reference (default 5×). `--update` rewrites the reference in-place.

* scripts/sx-build-all.sh — perf-smoke wired in as a post-step after JS
  tests. Hard fail if any benchmark regressed beyond budget.

Reference numbers: minimum across 6 back-to-back runs on this dev machine
under typical concurrent-loop contention (load ~9, 2 vCPU, 7.6 GiB RAM,
OCaml 5.2.0, architecture @ 92f6f187). Documented in
plans/jit-perf-regression.md including how to update them.

The 5× factor is chosen so contention noise (~1–2× variance) doesn't trigger
false alarms but a real ≥5× substrate regression — the kind that motivated
this whole investigation — fails the build immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:23:45 +00:00

21 KiB
Raw Blame History

JIT performance regression — substrate slowdown after architecture merge

A recent merge into architecture made test runs roughly 30× slower across guest languages — Tcl's lib/tcl/test.sh had to bump its watchdog from 180s → 2400s. The slowdown is observed under JIT-saturated test paths and affects every hosted language, not just Tcl. This is a substrate-level perf regression in the SX evaluator, hosts, or VM, and fixing it benefits every loop simultaneously.

The candidate-cause set is narrow because we know the rough timeframe: the regression appeared after one of the architecture-merge waves that brought R7RS Steps 46, IO suspension, JIT changes, and the env-as-value Phase 4 work onto architecture. Bisecting against a known-fast pre-merge commit will pin it.

Branch: architecture (substrate work). Touches spec/, hosts/ocaml/, hosts/javascript/. Do not push to main without explicit instruction.

North star: restore Tcl's test.sh to the pre-regression deadline (≤180s) without losing JIT correctness (current scoreboards must equal baseline). Document the regression mechanism so it doesn't recur silently.

Goals

  1. Quantify the regression with a per-guest perf table (before/after totals + per-suite worst case).
  2. Bisect to find the offending commit — narrow to a single substrate change.
  3. Diagnose the mechanism (JIT cache miss? env scan complexity? frame allocation? continuation snapshot?).
  4. Fix the root cause, not the symptom (do not just bump deadlines).
  5. Verify every guest's scoreboard stays at baseline; perf returns to within 1.5× of pre-regression.
  6. Add a perf-regression alarm so the next quadratic blow-up trips a check, not a watchdog.

Hypotheses (ranked)

Each gets validated or eliminated in Phase 3.

  1. env-as-value churn — Phase 4 changed how environments propagate. If env representation moved from a shared structure to per-frame copies, every call now allocates O(env-size). Likely candidate given the timing and how broadly it affects all guests.
  2. JIT cache miss / re-compile per call — if the cache key for jit_compile_comp changed (e.g. now keys on env or call-site dict), the cache hit-rate may have collapsed. Symptom: every call recompiles. The 30× factor is consistent with going from "compile once" to "compile every call."
  3. Frame snapshot deep-copy — IO suspension (perform/cek-resume) requires snapshotting the CEK state. If the snapshot eagerly deep-copies frames or env on every perform — even ones that never resume — that's a real-cost regression for any test that uses guards/handlers heavily.
  4. Lazy JIT bypassedproject_jit_compilation.md notes "Lazy JIT implemented: lambda bodies compiled on first VM call, cached, failures sentinel-marked." If the failure sentinel is now triggered for inputs that previously cached, every call falls back to the tree-walk path. Inspect project_jit_bytecode_bug.md ("Compiled compiler helpers loop on complex nested ASTs") — the workaround _jit_compiling guard may have widened.
  5. Type-check overhead — strict-mode value-matches-type? calls. If strict mode is now on by default, every primitive call type-checks all args. Unlikely to give 30× but worth ruling out.
  6. Frame representation: lists vs recordssx-improvements.md Step 12 ("Frame records (CEK)") is open. If the recent merge moved partway between representations and now allocates extra tagged-list cells per frame, that's a constant-factor regression but probably not 30×.

Phases

Phase 1 — Reproduce + quantify

  • Pick the canonical workload: lib/tcl/test.sh is the known offender. Also run lib/prolog/conformance.sh, lib/lua/test.sh, lib/haskell/conformance.sh, lib/erlang/conformance.sh for cross-guest data.

  • Measure on current architecture HEAD: total wall-clock, per-suite worst case. Use time bash lib/<guest>/...sh and capture both numbers.

  • Find a known-fast pre-regression commit. Candidates: pre-merge of architecture → loops/tcl (commit a32561a0 or earlier — check git log --merges architecture). Mark this BASELINE_GOOD.

  • Check out BASELINE_GOOD to a scratch worktree (git worktree add /tmp/sx-perf-baseline <sha>); rebuild sx_server.exe; re-run the same suites. Capture totals.

  • Build a perf table:

    Guest Pre-regression total Current total Ratio Pre-regression worst suite Current worst suite
    tcl …×
    prolog …×
    lua …×
    haskell …×
    erlang …×
  • If the ratio is uniform (~30× everywhere), it's a substrate-wide bug — fixing it once fixes everything. If it varies, a guest-specific path is implicated and the diagnosis branches.

Phase 2 — Bisect

  • git bisect start architecture <BASELINE_GOOD>.
  • Bisect script: rebuild sx_server.exe (cd hosts/ocaml && dune build), run time bash lib/tcl/test.sh with a tight 600s watchdog, mark commit good if total < 1.5× baseline, bad otherwise.
  • Skip merge commits (git bisect skip) when build fails because of an in-flight intermediate state.
  • Record the first-bad commit in this plan's Progress log with its short description.

Phase 3 — Diagnose

For each surviving hypothesis after Phase 2, validate or eliminate:

  • JIT cache miss check. Add a counter in hosts/ocaml/lib/sx_vm.ml that increments on jit_compile_comp invocations. Run the offending suite. If the counter is >>1 per unique lambda, the cache is missing.
  • Lazy JIT sentinel check. Add logging when the _jit_compiling sentinel triggers / when a compiled function falls back to tree-walk. Quantify how often it happens vs the baseline.
  • env-as-value allocation. Use OCaml's Gc.allocated_bytes before and after a representative call (e.g. (map (fn (x) (* x 2)) (list 1 2 3 4 5 6 7 8 9 10))). Compare allocation per call between baseline and current.
  • Frame snapshot cost. Profile a perform-heavy workload (e.g. Haskell IO tests). Compare time spent in snapshot/restore code paths.
  • Strict mode. Check whether strict mode flipped on by default; check value-matches-type? call frequency.

Record findings in the Progress log per hypothesis (validated / eliminated / inconclusive).

Phase 4 — Fix

The fix depends on the diagnosed cause; this section is filled in once Phase 3 lands. Constraints:

  • Do not regress correctness. Every guest scoreboard must stay at baseline before and after the fix. Regression of even 1 test means the fix is wrong.
  • Prefer the minimal change. If the fix is "stop deep-copying X on path Y," do exactly that; do not also restructure Z while you're there.
  • Keep the hot path obvious. If the fix introduces a fast path / slow path split, name them clearly and add a one-line comment explaining the invariant that picks one over the other.
  • Do not roll back env-as-value, R7RS Step 46, or IO suspension wholesale. Those are load-bearing changes; surgical fixes only.

Phase 5 — Verify

  • Re-run the perf table from Phase 1 on the fix. Target: each guest within 1.5× of pre-regression total.
  • Re-run every guest's conformance suite. Each must equal baseline (lib-guest's lib/guest/baseline/<lang>.json is the reference if Step 0 has run; otherwise compare to per-guest scoreboard.json).
  • Restore Tcl's test.sh watchdog from 2400s back to 180s. If it doesn't fit, the fix is incomplete.
  • Push to architecture only after both perf and correctness checks pass. Never push to main.

Phase 6 — Perf-regression alarm

So the next quadratic blow-up doesn't hide behind a watchdog bump:

  • Add a lightweight perf benchmark — lib/perf-smoke.sx. Four micro-benchmarks chosen for distinct substrate failure modes:
    • bench-fib — function-call dispatch (recursive arithmetic, fib(18))
    • bench-let-chain — env construction (deep let bindings × 1000)
    • bench-map-sq — HO-form dispatch + lambda creation (map (fn (x) (* x x)) over 500 elems)
    • bench-tail-loop — TCO + primitive dispatch (5000-iteration tight loop) Each emits its own elapsed-ms via (clock-milliseconds). A warm-up pass populates JIT cache before the timed pass.
  • Wire it into scripts/sx-build-all.sh as a post-step after the JS test suite. Failing the perf budget fails the whole build (hard fail, not log-line).
  • Reference numbers + machine documented:

Perf-smoke reference

Reference numbers in scripts/perf-smoke.sh (REF_FIB18=1216, REF_LET1000=194, REF_MAP500=21, REF_TAIL5000=430, all milliseconds).

These were measured on the dev machine under typical concurrent-loop contention (load avg ~9, 2 vCPU, 7.6 GiB RAM, OCaml 5.2.0, architecture HEAD 92f6f187). They are the minimum across 6 back-to-back runs, i.e. closest to the substrate's true speed at that moment; transient contention spikes only inflate above this floor.

The default budget multiplier is 5× (FACTOR=5). Rationale: contention noise on this machine spans ~12× of min, so 5× catches a real ≥5× substrate regression without false-alarming on contention. Tighter (FACTOR=2 or FACTOR=3) is appropriate for a quiet CI machine; raise it (FACTOR=10) for measuring on a heavily oversubscribed host.

To update the reference (after an intentional substrate change like a JIT improvement, or when moving machines):

bash scripts/perf-smoke.sh --update   # rewrites REF_* in this script

Commit the diff with a one-line note explaining what changed.

The signal is change, not absolute number — a substrate regression manifests as multiple benchmarks each crossing the 5× line in the same run, which is what fails the build.

Ground rules

  • Branch: architecture. Commit locally. Never push to main. Push to architecture only after Phase 5 passes.
  • Scope: spec/, hosts/ocaml/, hosts/javascript/, lib/tcl/test.sh (deadline restoration only), plans/jit-perf-regression.md. Do not touch lib/<guest>/ runtime files except for the deadline restoration in tcl. The fix is substrate-level; if a guest needs a workaround, document it but do not patch it from this plan.
  • SX files: sx-tree MCP tools only. sx_validate after every edit.
  • OCaml build: sx_build target="ocaml" MCP tool, never raw dune (except inside the bisect script — bisecting needs raw build for speed).
  • Do not touch any active loop's worktree. lib-guest, minikanren, and any other loops in flight are already busy. If a loop's worktree needs a perf rebuild, restart it after the fix lands.
  • Pause loops if needed. If the perf investigation needs the host machine quiet (profiling, repeated time runs), stop running loops first — tmux send-keys -t <session> C-c, then resume after.

Blockers

(none yet)

Progress log

Newest first.

2026-05-08 — Phase 1 reproduce + quantify

Worktree: /root/rose-ash-bugs/jit-perf at bugs/jit-perf = 1eb9d0f8 (architecture@1eb9d0f8). Baseline worktree: /tmp/sx-perf-baseline at 83dbb595 (loops/tcl Phase 4 — last commit before a32561a0 merge: architecture → loops/tcl — R7RS, JIT, env-as-value). Fresh dune build bin/sx_server.exe in each.

Machine state during measurement: load avg 1923 on 2 CPUs, ~2 GB free RAM, 3.6 GB swap used. Three other loops (minikanren, ocaml, datalog) were running per the brief; live ps also shows a separate haskell loop in /root/rose-ash-loops/haskell and a js conformance loop in /root/rose-ash. Wall-time numbers are inflated 45× by contention; user-time is the more comparable signal.

Current state (architecture HEAD @ 1eb9d0f8)

Guest Outcome Wall User Tests
tcl lib/tcl/test.sh ✓ pass 3m30s 17.5s 376/376 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 59)
lua lib/lua/test.sh ✓ pass 45.9s 4.4s 185/185
erlang lib/erlang/conformance.sh 0 tests captured 2m1s 18.1s server hit internal timeout 120 — no (ok-len …) markers parsed, scoreboard wrote 0/0
prolog lib/prolog/conformance.sh OOM-killed (137) 6m2s bash parent killed by kernel OOM partway through suite chain
haskell lib/haskell/conformance.sh terminated 29m59s 1m57s run never completed; output file just Terminated, no scoreboard. (Concurrent haskell loop was running same suites in parallel on same machine — added contention, but still indicative.)

Worst suite per guest (current):

  • tcl: idiom (59 tests, the longest-running suite); test count alone doesn't pinpoint a specific outlier — wall time is dominated by the cumulative epoch chain
  • lua: only one suite; n/a
  • erlang: every suite — server times out before any suite completes
  • prolog: at least one of the 29 suites blows memory (likely a JIT-heavy one — needs Phase 3 to confirm)
  • haskell: program-fib etc. — each 120 s suite-budget likely exhausted by cumulative load + per-program eval

Sanity check lib/tcl/conformance.sh (different from test.sh — 4 .tcl programs): 11.7s, 3/4 PASS, 1 FAIL event-loop ("expected: done, got: "). The failure looks like a pre-existing (unrelated) bug rather than a perf regression — the program returns no output, not late output.

Baseline state (loops/tcl @ 83dbb595)

Guest Outcome Wall User Tests
tcl lib/tcl/test.sh ✓ pass (after bumping internal timeout 1801200 so the contention-stretched run could finish) 3m31s 19.1s 342/342 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 25)
lua lib/lua/test.sh ✓ pass 37.2s 2.7s 157/157
haskell lib/haskell/test.sh ✓ pass 5.2s 0.4s 43/43 (parser only — full conformance.sh did not yet exist)
prolog (parse+unify subset, run by hand) ✓ pass 4.3s 0.3s 72 (25+47)
erlang n/a no lib/erlang/conformance.sh at this commit

Cross-guest perf table

Guest Baseline user (per test) Current user (per test) Ratio (user) Status under same workload
tcl test.sh 19.1s / 342 = 55.8 ms 17.5s / 376 = 46.5 ms 0.83× (slightly faster) both pass
lua test.sh 2.7s / 157 = 17.2 ms 4.4s / 185 = 23.8 ms 1.38× both pass
prolog parse+unify 0.32s / 72 = 4.4 ms 0.26s / 72 = 3.6 ms 0.82× both pass
haskell parser-only 0.4s / 43 = 9.3 ms (subset not runnable in isolation; full conformance hangs) n/a n/a

Conclusion — premise check

The 30× uniform slowdown the plan describes is not visible in the canonical workloads I can measure on both ends of the bisect range. Per-test user time is not 30× worse on architecture HEAD vs 83dbb595:

  • tcl test.sh per-test user time: 55.8 ms → 46.5 ms (slightly faster, well within noise)
  • lua test.sh per-test user time: 17.2 ms → 23.8 ms (1.4×)
  • prolog parse+unify: 0.82× (slightly faster)

What is clearly broken on current is the large multi-suite conformance scripts for erlang/prolog/haskell:

  • erlang's 9 suites hit the 120 s server-side timeout before producing a single (ok-len) marker
  • prolog's 29-suite chain triggers an OOM kill
  • haskell's 18-suite + 156-program chain runs >30 min without completing

These three failures all share a profile: long single-process epoch chains that exercise progressively more JIT compilation and accumulate state. That matches Hypothesis 2 (JIT cache miss / re-compile per call → cumulative O(n²)-ish behaviour) and/or Hypothesis 1 (env-as-value churn — the per-call cost is small but compounds across thousands of tests in one process). It does not match a uniform per-call 30× slowdown.

The Tcl test.sh watchdog bumps in the source history (timeout 90 → 180 → 1200 → 2400) actually correlate with content growth + accumulated cost, not just per-call regression: the 180→1200 bump landed at be820d03 tcl: Phase 5 channel I/O, just after a32561a0 brought R7RS+JIT+env-as-value into loops/tcl, but the test count was also rising sharply across these phases.

Open question for the user before Phase 2

The framing in the plan's lead — "30× slower across guest languages" with Tcl's test.sh as the canonical offender — does not match what I'm seeing for tcl test.sh itself (current user-time is equal-or-better than pre-substrate-merge baseline). Before kicking off the heavy-compute Phase 2 bisect across architecture, I want to confirm:

  1. Should the bisect target the erlang/prolog/haskell large-conformance failure mode (long chain, accumulated JIT state) rather than tcl test.sh wall-time? That's where the regression is unambiguous.
  2. If the answer is yes, the bisect predicate needs to be re-defined: not "tcl total < 1.5× baseline" but something like "erlang conformance.sh produces any (ok-len) markers within 120 s" or "prolog conformance.sh completes without OOM".
  3. Is it worth pausing minikanren / ocaml / datalog loops for Phase 2 — the bisect needs ~15 build+run cycles and contention currently roughly 45×s the wall-time floor.

Stopping here per the brief. Awaiting go-ahead before starting Phase 2.

Artefacts: timing logs in /tmp/jit-perf-results/{current,baseline}-*.txt. Baseline worktree at /tmp/sx-perf-baseline (still in place). Tcl test.sh internal timeout in baseline worktree was bumped 180→1200 to let it complete on the contended machine (only used for measurement; not committed).

Phase 1 follow-up — quiet-machine re-measurement

After Phase 1 above, paused all other tmux sessions (apl, datalog, js, minikanren, ocaml, sx-haskell, sx-hs-f, sx-loops) via tmux send-keys C-c to remove contention noise, then re-ran all five guests on the same architecture HEAD 1eb9d0f8 build.

Guest Wall User Result
lib/tcl/test.sh 57.8s 16.3s 376/376 ✓
lib/lua/test.sh 27.3s 4.2s 185/185 ✓
lib/erlang/conformance.sh (with timeout 120 raised to 600 so it could complete) 3m25s 36.8s 530/530 ✓
lib/prolog/conformance.sh 3m54s 1m8.6s 590/590 ✓
lib/haskell/conformance.sh 6m59s 2m37s 156/156 ✓

Conclusion: there is no 30× substrate perf regression on architecture HEAD. Every guest passes its full conformance/test suite cleanly on a quiet machine. The earlier symptoms had three independent causes:

  1. Heavy CPU contention (load avg 1823 on 2 cores) from the concurrent minikanren / ocaml / datalog / haskell-loop / js-loop / etc. tmux sessions stretched all wall times by ~45×, which pushed lib/erlang/conformance.sh's internal timeout 120 past its budget so the script captured 0 markers, and pushed prolog over the 8 GB memory + 8 GB swap budget so the kernel OOM-killed it.
  2. One genuinely too-tight internal deadline: lib/erlang/conformance.sh uses timeout 120 for the entire 9-suite chain. Even on a quiet machine the run needs 3m25s wall (36.8s user). This is not contention — it's an under-budgeted script.
  3. Watchdog over-conservatism: lib/tcl/test.sh has timeout 2400. Quiet-machine wall is 57.8s — 41× under the deadline. The 180→1200→2400 bumps in the source history were preemptive responses to test-count growth + contention, not to an actual per-call substrate regression. The original 180s deadline is comfortable.

Hypotheses status:

  • (1) env-as-value churn: eliminated — per-test user time is essentially flat (or 0.83× actually faster) baseline → current.
  • (2) JIT cache miss / re-compile per call: eliminated — same.
  • (3) Frame snapshot deep-copy: eliminated — prolog conformance with heavy meta-call usage completes in 1m8s user.
  • (4) Lazy JIT bypassed: eliminated — same.
  • (5) Type-check overhead: eliminated — same.
  • (6) Frame representation: eliminated — same.

Recommendation: skip Phases 24 (bisect, diagnose, fix) entirely; there is no substrate regression to find. The plan's North-star outcome — restore Tcl's test.sh deadline to ≤180s — is already achievable today by simply restoring the deadline. Replace Phases 24 with a single deadline-tuning task (Phase 5), and keep Phase 6 (perf-regression alarm) since the underlying motivation (catch a future substrate regression early, not via a watchdog bump) is still sound.

Proposed Phase 5 (deadline tuning), pending user approval:

  • lib/tcl/test.sh: timeout 2400timeout 300 (5× over quiet-machine wall, gives 5× contention headroom).
  • lib/erlang/conformance.sh: timeout 120timeout 600 (the only genuinely too-tight deadline). Quiet wall 3m25s.
  • Other guests' deadlines: leave as-is (already comfortable).
  • No source-tree changes outside those two scripts.

Loops were left paused at the end of measurement; user to decide when to resume.