Replaces the watchdog-bump approach with an automated check. The next 5× (or
worse) substrate regression will trip the alarm at build time instead of
hiding behind a deadline bump and only being noticed weeks later.
Components:
* lib/perf-smoke.sx — four micro-benchmarks chosen for distinct substrate
failure modes: function-call dispatch (fib), env construction (let-chain),
HO-form dispatch + lambda creation (map-sq), TCO + primitive dispatch
(tail-loop). Warm-up pass populates JIT cache before the timed pass so we
measure the steady state.
* scripts/perf-smoke.sh — pipes lib/perf-smoke.sx to sx_server.exe, parses
per-bench wall-time, asserts each is within FACTOR× of the recorded
reference (default 5×). `--update` rewrites the reference in-place.
* scripts/sx-build-all.sh — perf-smoke wired in as a post-step after JS
tests. Hard fail if any benchmark regressed beyond budget.
Reference numbers: minimum across 6 back-to-back runs on this dev machine
under typical concurrent-loop contention (load ~9, 2 vCPU, 7.6 GiB RAM,
OCaml 5.2.0, architecture @ 92f6f187). Documented in
plans/jit-perf-regression.md including how to update them.
The 5× factor is chosen so contention noise (~1–2× variance) doesn't trigger
false alarms but a real ≥5× substrate regression — the kind that motivated
this whole investigation — fails the build immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
JIT performance regression — substrate slowdown after architecture merge
A recent merge into architecture made test runs roughly 30× slower across guest languages — Tcl's lib/tcl/test.sh had to bump its watchdog from 180s → 2400s. The slowdown is observed under JIT-saturated test paths and affects every hosted language, not just Tcl. This is a substrate-level perf regression in the SX evaluator, hosts, or VM, and fixing it benefits every loop simultaneously.
The candidate-cause set is narrow because we know the rough timeframe: the regression appeared after one of the architecture-merge waves that brought R7RS Steps 4–6, IO suspension, JIT changes, and the env-as-value Phase 4 work onto architecture. Bisecting against a known-fast pre-merge commit will pin it.
Branch: architecture (substrate work). Touches spec/, hosts/ocaml/, hosts/javascript/. Do not push to main without explicit instruction.
North star: restore Tcl's test.sh to the pre-regression deadline (≤180s) without losing JIT correctness (current scoreboards must equal baseline). Document the regression mechanism so it doesn't recur silently.
Goals
- Quantify the regression with a per-guest perf table (before/after totals + per-suite worst case).
- Bisect to find the offending commit — narrow to a single substrate change.
- Diagnose the mechanism (JIT cache miss? env scan complexity? frame allocation? continuation snapshot?).
- Fix the root cause, not the symptom (do not just bump deadlines).
- Verify every guest's scoreboard stays at baseline; perf returns to within 1.5× of pre-regression.
- Add a perf-regression alarm so the next quadratic blow-up trips a check, not a watchdog.
Hypotheses (ranked)
Each gets validated or eliminated in Phase 3.
- env-as-value churn — Phase 4 changed how environments propagate. If env representation moved from a shared structure to per-frame copies, every call now allocates O(env-size). Likely candidate given the timing and how broadly it affects all guests.
- JIT cache miss / re-compile per call — if the cache key for
jit_compile_compchanged (e.g. now keys on env or call-site dict), the cache hit-rate may have collapsed. Symptom: every call recompiles. The 30× factor is consistent with going from "compile once" to "compile every call." - Frame snapshot deep-copy — IO suspension (
perform/cek-resume) requires snapshotting the CEK state. If the snapshot eagerly deep-copies frames or env on every perform — even ones that never resume — that's a real-cost regression for any test that uses guards/handlers heavily. - Lazy JIT bypassed —
project_jit_compilation.mdnotes "Lazy JIT implemented: lambda bodies compiled on first VM call, cached, failures sentinel-marked." If the failure sentinel is now triggered for inputs that previously cached, every call falls back to the tree-walk path. Inspectproject_jit_bytecode_bug.md("Compiled compiler helpers loop on complex nested ASTs") — the workaround_jit_compiling guardmay have widened. - Type-check overhead — strict-mode
value-matches-type?calls. If strict mode is now on by default, every primitive call type-checks all args. Unlikely to give 30× but worth ruling out. - Frame representation: lists vs records —
sx-improvements.mdStep 12 ("Frame records (CEK)") is open. If the recent merge moved partway between representations and now allocates extra tagged-list cells per frame, that's a constant-factor regression but probably not 30×.
Phases
Phase 1 — Reproduce + quantify
-
Pick the canonical workload:
lib/tcl/test.shis the known offender. Also runlib/prolog/conformance.sh,lib/lua/test.sh,lib/haskell/conformance.sh,lib/erlang/conformance.shfor cross-guest data. -
Measure on current
architectureHEAD: total wall-clock, per-suite worst case. Usetime bash lib/<guest>/...shand capture both numbers. -
Find a known-fast pre-regression commit. Candidates: pre-merge of
architecture → loops/tcl(commita32561a0or earlier — checkgit log --merges architecture). Mark thisBASELINE_GOOD. -
Check out
BASELINE_GOODto a scratch worktree (git worktree add /tmp/sx-perf-baseline <sha>); rebuildsx_server.exe; re-run the same suites. Capture totals. -
Build a perf table:
Guest Pre-regression total Current total Ratio Pre-regression worst suite Current worst suite tcl … … …× … … prolog … … …× … … lua … … …× … … haskell … … …× … … erlang … … …× … … -
If the ratio is uniform (~30× everywhere), it's a substrate-wide bug — fixing it once fixes everything. If it varies, a guest-specific path is implicated and the diagnosis branches.
Phase 2 — Bisect
git bisect start architecture <BASELINE_GOOD>.- Bisect script: rebuild
sx_server.exe(cd hosts/ocaml && dune build), runtime bash lib/tcl/test.shwith a tight 600s watchdog, mark commit good if total < 1.5× baseline, bad otherwise. - Skip merge commits (
git bisect skip) when build fails because of an in-flight intermediate state. - Record the first-bad commit in this plan's Progress log with its short description.
Phase 3 — Diagnose
For each surviving hypothesis after Phase 2, validate or eliminate:
- JIT cache miss check. Add a counter in
hosts/ocaml/lib/sx_vm.mlthat increments onjit_compile_compinvocations. Run the offending suite. If the counter is >>1 per unique lambda, the cache is missing. - Lazy JIT sentinel check. Add logging when the
_jit_compilingsentinel triggers / when a compiled function falls back to tree-walk. Quantify how often it happens vs the baseline. - env-as-value allocation. Use OCaml's
Gc.allocated_bytesbefore and after a representative call (e.g.(map (fn (x) (* x 2)) (list 1 2 3 4 5 6 7 8 9 10))). Compare allocation per call between baseline and current. - Frame snapshot cost. Profile a
perform-heavy workload (e.g. Haskell IO tests). Compare time spent in snapshot/restore code paths. - Strict mode. Check whether strict mode flipped on by default; check
value-matches-type?call frequency.
Record findings in the Progress log per hypothesis (validated / eliminated / inconclusive).
Phase 4 — Fix
The fix depends on the diagnosed cause; this section is filled in once Phase 3 lands. Constraints:
- Do not regress correctness. Every guest scoreboard must stay at baseline before and after the fix. Regression of even 1 test means the fix is wrong.
- Prefer the minimal change. If the fix is "stop deep-copying X on path Y," do exactly that; do not also restructure Z while you're there.
- Keep the hot path obvious. If the fix introduces a fast path / slow path split, name them clearly and add a one-line comment explaining the invariant that picks one over the other.
- Do not roll back env-as-value, R7RS Step 4–6, or IO suspension wholesale. Those are load-bearing changes; surgical fixes only.
Phase 5 — Verify
- Re-run the perf table from Phase 1 on the fix. Target: each guest within 1.5× of pre-regression total.
- Re-run every guest's conformance suite. Each must equal baseline (lib-guest's
lib/guest/baseline/<lang>.jsonis the reference if Step 0 has run; otherwise compare to per-guest scoreboard.json). - Restore Tcl's
test.shwatchdog from 2400s back to 180s. If it doesn't fit, the fix is incomplete. - Push to
architectureonly after both perf and correctness checks pass. Never push tomain.
Phase 6 — Perf-regression alarm
So the next quadratic blow-up doesn't hide behind a watchdog bump:
- Add a lightweight perf benchmark —
lib/perf-smoke.sx. Four micro-benchmarks chosen for distinct substrate failure modes:bench-fib— function-call dispatch (recursive arithmetic, fib(18))bench-let-chain— env construction (deep let bindings × 1000)bench-map-sq— HO-form dispatch + lambda creation (map (fn (x) (* x x))over 500 elems)bench-tail-loop— TCO + primitive dispatch (5000-iteration tight loop) Each emits its own elapsed-ms via(clock-milliseconds). A warm-up pass populates JIT cache before the timed pass.
- Wire it into
scripts/sx-build-all.shas a post-step after the JS test suite. Failing the perf budget fails the whole build (hard fail, not log-line). - Reference numbers + machine documented:
Perf-smoke reference
Reference numbers in scripts/perf-smoke.sh (REF_FIB18=1216, REF_LET1000=194, REF_MAP500=21, REF_TAIL5000=430, all milliseconds).
These were measured on the dev machine under typical concurrent-loop contention (load avg ~9, 2 vCPU, 7.6 GiB RAM, OCaml 5.2.0, architecture HEAD 92f6f187). They are the minimum across 6 back-to-back runs, i.e. closest to the substrate's true speed at that moment; transient contention spikes only inflate above this floor.
The default budget multiplier is 5× (FACTOR=5). Rationale: contention noise on this machine spans ~1–2× of min, so 5× catches a real ≥5× substrate regression without false-alarming on contention. Tighter (FACTOR=2 or FACTOR=3) is appropriate for a quiet CI machine; raise it (FACTOR=10) for measuring on a heavily oversubscribed host.
To update the reference (after an intentional substrate change like a JIT improvement, or when moving machines):
bash scripts/perf-smoke.sh --update # rewrites REF_* in this script
Commit the diff with a one-line note explaining what changed.
The signal is change, not absolute number — a substrate regression manifests as multiple benchmarks each crossing the 5× line in the same run, which is what fails the build.
Ground rules
- Branch:
architecture. Commit locally. Never push tomain. Push toarchitectureonly after Phase 5 passes. - Scope:
spec/,hosts/ocaml/,hosts/javascript/,lib/tcl/test.sh(deadline restoration only),plans/jit-perf-regression.md. Do not touchlib/<guest>/runtime files except for the deadline restoration in tcl. The fix is substrate-level; if a guest needs a workaround, document it but do not patch it from this plan. - SX files:
sx-treeMCP tools only.sx_validateafter every edit. - OCaml build:
sx_build target="ocaml"MCP tool, never rawdune(except inside the bisect script — bisecting needs raw build for speed). - Do not touch any active loop's worktree. lib-guest, minikanren, and any other loops in flight are already busy. If a loop's worktree needs a perf rebuild, restart it after the fix lands.
- Pause loops if needed. If the perf investigation needs the host machine quiet (profiling, repeated
timeruns), stop running loops first —tmux send-keys -t <session> C-c, then resume after.
Blockers
(none yet)
Progress log
Newest first.
2026-05-08 — Phase 1 reproduce + quantify
Worktree: /root/rose-ash-bugs/jit-perf at bugs/jit-perf = 1eb9d0f8 (architecture@1eb9d0f8).
Baseline worktree: /tmp/sx-perf-baseline at 83dbb595 (loops/tcl Phase 4 — last commit before a32561a0 merge: architecture → loops/tcl — R7RS, JIT, env-as-value). Fresh dune build bin/sx_server.exe in each.
Machine state during measurement: load avg 19–23 on 2 CPUs, ~2 GB free RAM, 3.6 GB swap used. Three other loops (minikanren, ocaml, datalog) were running per the brief; live ps also shows a separate haskell loop in /root/rose-ash-loops/haskell and a js conformance loop in /root/rose-ash. Wall-time numbers are inflated 4–5× by contention; user-time is the more comparable signal.
Current state (architecture HEAD @ 1eb9d0f8)
| Guest | Outcome | Wall | User | Tests |
|---|---|---|---|---|
tcl lib/tcl/test.sh |
✓ pass | 3m30s | 17.5s | 376/376 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 59) |
lua lib/lua/test.sh |
✓ pass | 45.9s | 4.4s | 185/185 |
erlang lib/erlang/conformance.sh |
✗ 0 tests captured | 2m1s | 18.1s | server hit internal timeout 120 — no (ok-len …) markers parsed, scoreboard wrote 0/0 |
prolog lib/prolog/conformance.sh |
✗ OOM-killed (137) | 6m2s | — | bash parent killed by kernel OOM partway through suite chain |
haskell lib/haskell/conformance.sh |
✗ terminated | 29m59s | 1m57s | run never completed; output file just Terminated, no scoreboard. (Concurrent haskell loop was running same suites in parallel on same machine — added contention, but still indicative.) |
Worst suite per guest (current):
- tcl: idiom (59 tests, the longest-running suite); test count alone doesn't pinpoint a specific outlier — wall time is dominated by the cumulative epoch chain
- lua: only one suite; n/a
- erlang: every suite — server times out before any suite completes
- prolog: at least one of the 29 suites blows memory (likely a JIT-heavy one — needs Phase 3 to confirm)
- haskell:
program-fibetc. — each 120 s suite-budget likely exhausted by cumulative load + per-program eval
Sanity check lib/tcl/conformance.sh (different from test.sh — 4 .tcl programs): 11.7s, 3/4 PASS, 1 FAIL event-loop ("expected: done, got: "). The failure looks like a pre-existing (unrelated) bug rather than a perf regression — the program returns no output, not late output.
Baseline state (loops/tcl @ 83dbb595)
| Guest | Outcome | Wall | User | Tests |
|---|---|---|---|---|
tcl lib/tcl/test.sh |
✓ pass (after bumping internal timeout 180→1200 so the contention-stretched run could finish) |
3m31s | 19.1s | 342/342 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 25) |
lua lib/lua/test.sh |
✓ pass | 37.2s | 2.7s | 157/157 |
haskell lib/haskell/test.sh |
✓ pass | 5.2s | 0.4s | 43/43 (parser only — full conformance.sh did not yet exist) |
| prolog (parse+unify subset, run by hand) | ✓ pass | 4.3s | 0.3s | 72 (25+47) |
| erlang | n/a | — | — | no lib/erlang/conformance.sh at this commit |
Cross-guest perf table
| Guest | Baseline user (per test) | Current user (per test) | Ratio (user) | Status under same workload |
|---|---|---|---|---|
tcl test.sh |
19.1s / 342 = 55.8 ms | 17.5s / 376 = 46.5 ms | 0.83× (slightly faster) | both pass |
lua test.sh |
2.7s / 157 = 17.2 ms | 4.4s / 185 = 23.8 ms | 1.38× | both pass |
| prolog parse+unify | 0.32s / 72 = 4.4 ms | 0.26s / 72 = 3.6 ms | 0.82× | both pass |
| haskell parser-only | 0.4s / 43 = 9.3 ms | (subset not runnable in isolation; full conformance hangs) | n/a | n/a |
Conclusion — premise check
The 30× uniform slowdown the plan describes is not visible in the canonical workloads I can measure on both ends of the bisect range. Per-test user time is not 30× worse on architecture HEAD vs 83dbb595:
- tcl
test.shper-test user time: 55.8 ms → 46.5 ms (slightly faster, well within noise) - lua
test.shper-test user time: 17.2 ms → 23.8 ms (1.4×) - prolog parse+unify: 0.82× (slightly faster)
What is clearly broken on current is the large multi-suite conformance scripts for erlang/prolog/haskell:
- erlang's 9 suites hit the 120 s server-side
timeoutbefore producing a single(ok-len)marker - prolog's 29-suite chain triggers an OOM kill
- haskell's 18-suite + 156-program chain runs >30 min without completing
These three failures all share a profile: long single-process epoch chains that exercise progressively more JIT compilation and accumulate state. That matches Hypothesis 2 (JIT cache miss / re-compile per call → cumulative O(n²)-ish behaviour) and/or Hypothesis 1 (env-as-value churn — the per-call cost is small but compounds across thousands of tests in one process). It does not match a uniform per-call 30× slowdown.
The Tcl test.sh watchdog bumps in the source history (timeout 90 → 180 → 1200 → 2400) actually correlate with content growth + accumulated cost, not just per-call regression: the 180→1200 bump landed at be820d03 tcl: Phase 5 channel I/O, just after a32561a0 brought R7RS+JIT+env-as-value into loops/tcl, but the test count was also rising sharply across these phases.
Open question for the user before Phase 2
The framing in the plan's lead — "30× slower across guest languages" with Tcl's test.sh as the canonical offender — does not match what I'm seeing for tcl test.sh itself (current user-time is equal-or-better than pre-substrate-merge baseline). Before kicking off the heavy-compute Phase 2 bisect across architecture, I want to confirm:
- Should the bisect target the erlang/prolog/haskell large-conformance failure mode (long chain, accumulated JIT state) rather than
tcl test.shwall-time? That's where the regression is unambiguous. - If the answer is yes, the bisect predicate needs to be re-defined: not "tcl total < 1.5× baseline" but something like "erlang conformance.sh produces any (ok-len) markers within 120 s" or "prolog conformance.sh completes without OOM".
- Is it worth pausing minikanren / ocaml / datalog loops for Phase 2 — the bisect needs ~15 build+run cycles and contention currently roughly 4–5×s the wall-time floor.
Stopping here per the brief. Awaiting go-ahead before starting Phase 2.
Artefacts: timing logs in /tmp/jit-perf-results/{current,baseline}-*.txt. Baseline worktree at /tmp/sx-perf-baseline (still in place). Tcl test.sh internal timeout in baseline worktree was bumped 180→1200 to let it complete on the contended machine (only used for measurement; not committed).
Phase 1 follow-up — quiet-machine re-measurement
After Phase 1 above, paused all other tmux sessions (apl, datalog, js, minikanren, ocaml, sx-haskell, sx-hs-f, sx-loops) via tmux send-keys C-c to remove contention noise, then re-ran all five guests on the same architecture HEAD 1eb9d0f8 build.
| Guest | Wall | User | Result |
|---|---|---|---|
lib/tcl/test.sh |
57.8s | 16.3s | 376/376 ✓ |
lib/lua/test.sh |
27.3s | 4.2s | 185/185 ✓ |
lib/erlang/conformance.sh (with timeout 120 raised to 600 so it could complete) |
3m25s | 36.8s | 530/530 ✓ |
lib/prolog/conformance.sh |
3m54s | 1m8.6s | 590/590 ✓ |
lib/haskell/conformance.sh |
6m59s | 2m37s | 156/156 ✓ |
Conclusion: there is no 30× substrate perf regression on architecture HEAD. Every guest passes its full conformance/test suite cleanly on a quiet machine. The earlier symptoms had three independent causes:
- Heavy CPU contention (load avg 18–23 on 2 cores) from the concurrent minikanren / ocaml / datalog / haskell-loop / js-loop / etc. tmux sessions stretched all wall times by ~4–5×, which pushed
lib/erlang/conformance.sh's internaltimeout 120past its budget so the script captured 0 markers, and pushed prolog over the 8 GB memory + 8 GB swap budget so the kernel OOM-killed it. - One genuinely too-tight internal deadline:
lib/erlang/conformance.shusestimeout 120for the entire 9-suite chain. Even on a quiet machine the run needs 3m25s wall (36.8s user). This is not contention — it's an under-budgeted script. - Watchdog over-conservatism:
lib/tcl/test.shhastimeout 2400. Quiet-machine wall is 57.8s — 41× under the deadline. The 180→1200→2400 bumps in the source history were preemptive responses to test-count growth + contention, not to an actual per-call substrate regression. The original 180s deadline is comfortable.
Hypotheses status:
- (1) env-as-value churn: eliminated — per-test user time is essentially flat (or 0.83× actually faster) baseline → current.
- (2) JIT cache miss / re-compile per call: eliminated — same.
- (3) Frame snapshot deep-copy: eliminated — prolog conformance with heavy meta-call usage completes in 1m8s user.
- (4) Lazy JIT bypassed: eliminated — same.
- (5) Type-check overhead: eliminated — same.
- (6) Frame representation: eliminated — same.
Recommendation: skip Phases 2–4 (bisect, diagnose, fix) entirely; there is no substrate regression to find. The plan's North-star outcome — restore Tcl's test.sh deadline to ≤180s — is already achievable today by simply restoring the deadline. Replace Phases 2–4 with a single deadline-tuning task (Phase 5), and keep Phase 6 (perf-regression alarm) since the underlying motivation (catch a future substrate regression early, not via a watchdog bump) is still sound.
Proposed Phase 5 (deadline tuning), pending user approval:
lib/tcl/test.sh:timeout 2400→timeout 300(5× over quiet-machine wall, gives 5× contention headroom).lib/erlang/conformance.sh:timeout 120→timeout 600(the only genuinely too-tight deadline). Quiet wall 3m25s.- Other guests' deadlines: leave as-is (already comfortable).
- No source-tree changes outside those two scripts.
Loops were left paused at the end of measurement; user to decide when to resume.