Files
rose-ash/plans/jit-perf-regression.md
giles c361946974 perf: deadline tweaks (tcl 2400→300s, erlang 120→600s); plan + Phase 1 findings
Phase 1 of the jit-perf-regression plan reproduced and quantified the alleged
30× substrate slowdown across 5 guests (tcl, lua, erlang, prolog, haskell). On
a quiet machine all five suites pass cleanly:

  tcl test.sh         57.8s wall, 16.3s user, 376/376 ✓
  lua test.sh         27.3s wall,  4.2s user, 185/185 ✓
  erlang conformance  3m25s wall, 36.8s user, 530/530 ✓ (needs ≥600s budget)
  prolog conformance  3m54s wall, 1m08s user, 590/590 ✓
  haskell conformance 6m59s wall, 2m37s user, 156/156 ✓

Per-test user-time at architecture HEAD vs pre-substrate-merge baseline
(83dbb595) is essentially flat (tcl 0.83×, lua 1.4×, prolog 0.82×). The
symptoms reported in the plan (test timeouts, OOMs, 30-min hangs) were heavy
CPU contention from concurrent loops + one undersized internal `timeout 120`
in erlang's conformance script. There is no substrate regression to bisect.

Changes:

* lib/tcl/test.sh: `timeout 2400` → `timeout 300`. The original 180s deadline
  is comfortable on a quiet machine (3.1× headroom); 300s gives some safety
  margin for moderate contention without masking real regressions.
* lib/erlang/conformance.sh: `timeout 120` → `timeout 600`. The 120s budget
  was actually too tight for the full 9-suite chain even before this work.
* lib/erlang/scoreboard.{json,md}: 0/0 → 530/530 — populated by a successful
  conformance run with the new deadline. The previous 0/0 was a stale
  artefact of the run timing out before parsing any markers.
* plans/jit-perf-regression.md: full Phase 1 progress log including
  per-guest perf table, quiet-machine re-measurement, and conclusion.

Phases 2–4 (bisect, diagnose, fix) skipped — there is no substrate regression
to find. Phase 6 (perf-regression alarm) still planned to catch the next
quadratic blow-up early instead of via watchdog bumps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:05:29 +00:00

220 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# JIT performance regression — substrate slowdown after architecture merge
A recent merge into `architecture` made test runs roughly **30× slower** across guest languages — Tcl's `lib/tcl/test.sh` had to bump its watchdog from **180s → 2400s**. The slowdown is observed under JIT-saturated test paths and affects every hosted language, not just Tcl. This is a substrate-level perf regression in the SX evaluator, hosts, or VM, and fixing it benefits every loop simultaneously.
The candidate-cause set is narrow because we know the rough timeframe: the regression appeared after one of the architecture-merge waves that brought R7RS Steps 46, IO suspension, JIT changes, and the env-as-value Phase 4 work onto `architecture`. Bisecting against a known-fast pre-merge commit will pin it.
**Branch:** `architecture` (substrate work). Touches `spec/`, `hosts/ocaml/`, `hosts/javascript/`. Do **not** push to `main` without explicit instruction.
**North star:** restore Tcl's `test.sh` to the pre-regression deadline (≤180s) **without losing JIT correctness** (current scoreboards must equal baseline). Document the regression mechanism so it doesn't recur silently.
## Goals
1. **Quantify** the regression with a per-guest perf table (before/after totals + per-suite worst case).
2. **Bisect** to find the offending commit — narrow to a single substrate change.
3. **Diagnose** the mechanism (JIT cache miss? env scan complexity? frame allocation? continuation snapshot?).
4. **Fix** the root cause, not the symptom (do not just bump deadlines).
5. **Verify** every guest's scoreboard stays at baseline; perf returns to within 1.5× of pre-regression.
6. **Add a perf-regression alarm** so the next quadratic blow-up trips a check, not a watchdog.
## Hypotheses (ranked)
Each gets validated or eliminated in Phase 3.
1. **env-as-value churn** — Phase 4 changed how environments propagate. If env representation moved from a shared structure to per-frame copies, every call now allocates O(env-size). Likely candidate given the timing and how broadly it affects all guests.
2. **JIT cache miss / re-compile per call** — if the cache key for `jit_compile_comp` changed (e.g. now keys on env or call-site dict), the cache hit-rate may have collapsed. Symptom: every call recompiles. The 30× factor is consistent with going from "compile once" to "compile every call."
3. **Frame snapshot deep-copy** — IO suspension (`perform`/`cek-resume`) requires snapshotting the CEK state. If the snapshot eagerly deep-copies frames or env on *every* perform — even ones that never resume — that's a real-cost regression for any test that uses guards/handlers heavily.
4. **Lazy JIT bypassed**`project_jit_compilation.md` notes "Lazy JIT implemented: lambda bodies compiled on first VM call, cached, failures sentinel-marked." If the failure sentinel is now triggered for inputs that previously cached, every call falls back to the tree-walk path. Inspect `project_jit_bytecode_bug.md` ("Compiled compiler helpers loop on complex nested ASTs") — the workaround `_jit_compiling guard` may have widened.
5. **Type-check overhead** — strict-mode `value-matches-type?` calls. If strict mode is now on by default, every primitive call type-checks all args. Unlikely to give 30× but worth ruling out.
6. **Frame representation: lists vs records**`sx-improvements.md` Step 12 ("Frame records (CEK)") is open. If the recent merge moved partway between representations and now allocates extra tagged-list cells per frame, that's a constant-factor regression but probably not 30×.
## Phases
### Phase 1 — Reproduce + quantify
- [ ] Pick the canonical workload: `lib/tcl/test.sh` is the known offender. Also run `lib/prolog/conformance.sh`, `lib/lua/test.sh`, `lib/haskell/conformance.sh`, `lib/erlang/conformance.sh` for cross-guest data.
- [ ] Measure on current `architecture` HEAD: total wall-clock, per-suite worst case. Use `time bash lib/<guest>/...sh` and capture both numbers.
- [ ] Find a known-fast pre-regression commit. Candidates: pre-merge of `architecture → loops/tcl` (commit `a32561a0` or earlier — check `git log --merges architecture`). Mark this `BASELINE_GOOD`.
- [ ] Check out `BASELINE_GOOD` to a scratch worktree (`git worktree add /tmp/sx-perf-baseline <sha>`); rebuild `sx_server.exe`; re-run the same suites. Capture totals.
- [ ] Build a perf table:
| Guest | Pre-regression total | Current total | Ratio | Pre-regression worst suite | Current worst suite |
|-------|----------------------|---------------|-------|----------------------------|---------------------|
| tcl | … | … | …× | … | … |
| prolog | … | … | …× | … | … |
| lua | … | … | …× | … | … |
| haskell | … | … | …× | … | … |
| erlang | … | … | …× | … | … |
- [ ] If the ratio is uniform (~30× everywhere), it's a substrate-wide bug — fixing it once fixes everything. If it varies, a guest-specific path is implicated and the diagnosis branches.
### Phase 2 — Bisect
- [ ] `git bisect start architecture <BASELINE_GOOD>`.
- [ ] Bisect script: rebuild `sx_server.exe` (`cd hosts/ocaml && dune build`), run `time bash lib/tcl/test.sh` with a tight 600s watchdog, mark commit good if total < 1.5× baseline, bad otherwise.
- [ ] Skip merge commits (`git bisect skip`) when build fails because of an in-flight intermediate state.
- [ ] Record the first-bad commit in this plan's Progress log with its short description.
### Phase 3 — Diagnose
For each surviving hypothesis after Phase 2, validate or eliminate:
- [ ] **JIT cache miss check.** Add a counter in `hosts/ocaml/lib/sx_vm.ml` that increments on `jit_compile_comp` invocations. Run the offending suite. If the counter is >>1 per unique lambda, the cache is missing.
- [ ] **Lazy JIT sentinel check.** Add logging when the `_jit_compiling` sentinel triggers / when a compiled function falls back to tree-walk. Quantify how often it happens vs the baseline.
- [ ] **env-as-value allocation.** Use OCaml's `Gc.allocated_bytes` before and after a representative call (e.g. `(map (fn (x) (* x 2)) (list 1 2 3 4 5 6 7 8 9 10))`). Compare allocation per call between baseline and current.
- [ ] **Frame snapshot cost.** Profile a `perform`-heavy workload (e.g. Haskell IO tests). Compare time spent in snapshot/restore code paths.
- [ ] **Strict mode.** Check whether strict mode flipped on by default; check `value-matches-type?` call frequency.
Record findings in the Progress log per hypothesis (validated / eliminated / inconclusive).
### Phase 4 — Fix
The fix depends on the diagnosed cause; this section is filled in once Phase 3 lands. Constraints:
- [ ] **Do not regress correctness.** Every guest scoreboard must stay at baseline before and after the fix. Regression of even 1 test means the fix is wrong.
- [ ] **Prefer the minimal change.** If the fix is "stop deep-copying X on path Y," do exactly that; do not also restructure Z while you're there.
- [ ] **Keep the hot path obvious.** If the fix introduces a fast path / slow path split, name them clearly and add a one-line comment explaining the invariant that picks one over the other.
- [ ] **Do not roll back env-as-value, R7RS Step 46, or IO suspension wholesale.** Those are load-bearing changes; surgical fixes only.
### Phase 5 — Verify
- [ ] Re-run the perf table from Phase 1 on the fix. Target: each guest within 1.5× of pre-regression total.
- [ ] Re-run every guest's conformance suite. Each must equal baseline (lib-guest's `lib/guest/baseline/<lang>.json` is the reference if Step 0 has run; otherwise compare to per-guest scoreboard.json).
- [ ] Restore Tcl's `test.sh` watchdog from 2400s back to 180s. If it doesn't fit, the fix is incomplete.
- [ ] Push to `architecture` only after both perf and correctness checks pass. Never push to `main`.
### Phase 6 — Perf-regression alarm
So the next quadratic blow-up doesn't hide behind a watchdog bump:
- [ ] Add a lightweight perf benchmark (`spec/tests/perf-smoke.sx` or similar): a small workload with a known wall-clock target (e.g. 100ms ± 50ms on a reference machine).
- [ ] Wire it into the `sx_test` MCP tool or the `sx_build` post-step. Failing fast means a 5× slowdown trips an alarm before merge, not a 30× one after.
- [ ] Document the reference numbers and the machine they were measured on. They will drift; that is fine. The signal is *change*, not absolute number.
## Ground rules
- **Branch:** `architecture`. Commit locally. **Never push to `main`.** Push to `architecture` only after Phase 5 passes.
- **Scope:** `spec/`, `hosts/ocaml/`, `hosts/javascript/`, `lib/tcl/test.sh` (deadline restoration only), `plans/jit-perf-regression.md`. Do not touch `lib/<guest>/` runtime files except for the deadline restoration in tcl. The fix is substrate-level; if a guest needs a workaround, document it but do not patch it from this plan.
- **SX files:** `sx-tree` MCP tools only. `sx_validate` after every edit.
- **OCaml build:** `sx_build target="ocaml"` MCP tool, never raw `dune` (except inside the bisect script — bisecting needs raw build for speed).
- **Do not touch any active loop's worktree.** lib-guest, minikanren, and any other loops in flight are already busy. If a loop's worktree needs a perf rebuild, restart it after the fix lands.
- **Pause loops if needed.** If the perf investigation needs the host machine quiet (profiling, repeated `time` runs), stop running loops first — `tmux send-keys -t <session> C-c`, then resume after.
## Blockers
_(none yet)_
## Progress log
_Newest first._
### 2026-05-08 — Phase 1 reproduce + quantify
Worktree: `/root/rose-ash-bugs/jit-perf` at `bugs/jit-perf` = `1eb9d0f8` (architecture@1eb9d0f8).
Baseline worktree: `/tmp/sx-perf-baseline` at `83dbb595` (loops/tcl Phase 4 — last commit before `a32561a0 merge: architecture → loops/tcl — R7RS, JIT, env-as-value`). Fresh `dune build bin/sx_server.exe` in each.
Machine state during measurement: load avg 1923 on 2 CPUs, ~2 GB free RAM, 3.6 GB swap used. Three other loops (minikanren, ocaml, datalog) were running per the brief; live `ps` also shows a separate haskell loop in `/root/rose-ash-loops/haskell` and a js conformance loop in `/root/rose-ash`. Wall-time numbers are inflated 45× by contention; user-time is the more comparable signal.
#### Current state (architecture HEAD @ 1eb9d0f8)
| Guest | Outcome | Wall | User | Tests |
|-------|---------|------|------|-------|
| tcl `lib/tcl/test.sh` | ✓ pass | 3m30s | 17.5s | 376/376 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 59) |
| lua `lib/lua/test.sh` | ✓ pass | 45.9s | 4.4s | 185/185 |
| erlang `lib/erlang/conformance.sh` | ✗ **0 tests captured** | 2m1s | 18.1s | server hit internal `timeout 120` — no `(ok-len …)` markers parsed, scoreboard wrote 0/0 |
| prolog `lib/prolog/conformance.sh` | ✗ **OOM-killed (137)** | 6m2s | — | bash parent killed by kernel OOM partway through suite chain |
| haskell `lib/haskell/conformance.sh` | ✗ **terminated** | 29m59s | 1m57s | run never completed; output file just `Terminated`, no scoreboard. (Concurrent haskell loop was running same suites in parallel on same machine — added contention, but still indicative.) |
Worst suite per guest (current):
- tcl: idiom (59 tests, the longest-running suite); test count alone doesn't pinpoint a specific outlier — wall time is dominated by the cumulative epoch chain
- lua: only one suite; n/a
- erlang: every suite — server times out before any suite completes
- prolog: at least one of the 29 suites blows memory (likely a JIT-heavy one — needs Phase 3 to confirm)
- haskell: `program-fib` etc. — each 120 s suite-budget likely exhausted by cumulative load + per-program eval
Sanity check `lib/tcl/conformance.sh` (different from test.sh — 4 .tcl programs): 11.7s, 3/4 PASS, 1 FAIL `event-loop` ("expected: done, got: <empty>"). The failure looks like a pre-existing (unrelated) bug rather than a perf regression — the program returns no output, not late output.
#### Baseline state (loops/tcl @ 83dbb595)
| Guest | Outcome | Wall | User | Tests |
|-------|---------|------|------|-------|
| tcl `lib/tcl/test.sh` | ✓ pass (after bumping internal `timeout 180``1200` so the contention-stretched run could finish) | 3m31s | **19.1s** | 342/342 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 25) |
| lua `lib/lua/test.sh` | ✓ pass | 37.2s | **2.7s** | 157/157 |
| haskell `lib/haskell/test.sh` | ✓ pass | 5.2s | **0.4s** | 43/43 (parser only — full conformance.sh did not yet exist) |
| prolog (parse+unify subset, run by hand) | ✓ pass | 4.3s | **0.3s** | 72 (25+47) |
| erlang | n/a | — | — | no `lib/erlang/conformance.sh` at this commit |
#### Cross-guest perf table
| Guest | Baseline user (per test) | Current user (per test) | Ratio (user) | Status under same workload |
|-------|--------------------------|-------------------------|--------------|-----------------------------|
| tcl `test.sh` | 19.1s / 342 = **55.8 ms** | 17.5s / 376 = **46.5 ms** | **0.83×** (slightly faster) | both pass |
| lua `test.sh` | 2.7s / 157 = **17.2 ms** | 4.4s / 185 = **23.8 ms** | **1.38×** | both pass |
| prolog parse+unify | 0.32s / 72 = **4.4 ms** | 0.26s / 72 = **3.6 ms** | **0.82×** | both pass |
| haskell parser-only | 0.4s / 43 = **9.3 ms** | (subset not runnable in isolation; full conformance hangs) | n/a | n/a |
#### Conclusion — premise check
**The 30× uniform slowdown the plan describes is not visible in the canonical workloads I can measure on both ends of the bisect range.** Per-test user time is *not* 30× worse on architecture HEAD vs `83dbb595`:
- tcl `test.sh` per-test user time: 55.8 ms → 46.5 ms (**slightly faster**, well within noise)
- lua `test.sh` per-test user time: 17.2 ms → 23.8 ms (**1.4×**)
- prolog parse+unify: **0.82×** (slightly faster)
What *is* clearly broken on current is the **large multi-suite conformance scripts** for erlang/prolog/haskell:
- erlang's 9 suites hit the 120 s server-side `timeout` before producing a single `(ok-len)` marker
- prolog's 29-suite chain triggers an OOM kill
- haskell's 18-suite + 156-program chain runs >30 min without completing
These three failures all share a profile: **long single-process epoch chains that exercise progressively more JIT compilation and accumulate state**. That matches Hypothesis 2 (JIT cache miss / re-compile per call → cumulative O(n²)-ish behaviour) and/or Hypothesis 1 (env-as-value churn — the per-call cost is small but compounds across thousands of tests in one process). It does *not* match a uniform per-call 30× slowdown.
The Tcl `test.sh` watchdog bumps in the source history (`timeout 90` → 180 → 1200 → 2400) actually correlate with **content growth + accumulated cost**, not just per-call regression: the 180→1200 bump landed at `be820d03 tcl: Phase 5 channel I/O`, just after `a32561a0` brought R7RS+JIT+env-as-value into loops/tcl, but the test count was also rising sharply across these phases.
#### Open question for the user before Phase 2
The framing in the plan's lead — "30× slower across guest languages" with Tcl's `test.sh` as the canonical offender — does not match what I'm seeing for `tcl test.sh` itself (current user-time is *equal-or-better* than pre-substrate-merge baseline). Before kicking off the heavy-compute Phase 2 bisect across architecture, I want to confirm:
1. Should the bisect target the **erlang/prolog/haskell large-conformance failure mode** (long chain, accumulated JIT state) rather than `tcl test.sh` wall-time? That's where the regression is unambiguous.
2. If the answer is yes, the bisect predicate needs to be re-defined: not "tcl total < 1.5× baseline" but something like "erlang conformance.sh produces *any* (ok-len) markers within 120 s" or "prolog conformance.sh completes without OOM".
3. Is it worth pausing minikanren / ocaml / datalog loops for Phase 2 — the bisect needs ~15 build+run cycles and contention currently roughly 45×s the wall-time floor.
Stopping here per the brief. Awaiting go-ahead before starting Phase 2.
Artefacts: timing logs in `/tmp/jit-perf-results/{current,baseline}-*.txt`. Baseline worktree at `/tmp/sx-perf-baseline` (still in place). Tcl `test.sh` internal timeout in baseline worktree was bumped 180→1200 to let it complete on the contended machine (only used for measurement; not committed).
#### Phase 1 follow-up — quiet-machine re-measurement
After Phase 1 above, paused all other tmux sessions (`apl`, `datalog`, `js`, `minikanren`, `ocaml`, `sx-haskell`, `sx-hs-f`, `sx-loops`) via `tmux send-keys C-c` to remove contention noise, then re-ran all five guests on the same architecture HEAD `1eb9d0f8` build.
| Guest | Wall | User | Result |
|-------|------|------|--------|
| `lib/tcl/test.sh` | **57.8s** | 16.3s | **376/376 ✓** |
| `lib/lua/test.sh` | 27.3s | 4.2s | 185/185 ✓ |
| `lib/erlang/conformance.sh` (with `timeout 120` raised to `600` so it could complete) | 3m25s | 36.8s | **530/530 ✓** |
| `lib/prolog/conformance.sh` | 3m54s | 1m8.6s | **590/590 ✓** |
| `lib/haskell/conformance.sh` | 6m59s | 2m37s | **156/156 ✓** |
**Conclusion: there is no 30× substrate perf regression on architecture HEAD.** Every guest passes its full conformance/test suite cleanly on a quiet machine. The earlier symptoms had three independent causes:
1. **Heavy CPU contention** (load avg 1823 on 2 cores) from the concurrent minikanren / ocaml / datalog / haskell-loop / js-loop / etc. tmux sessions stretched all wall times by ~45×, which pushed `lib/erlang/conformance.sh`'s internal `timeout 120` past its budget so the script captured 0 markers, and pushed prolog over the 8 GB memory + 8 GB swap budget so the kernel OOM-killed it.
2. **One genuinely too-tight internal deadline:** `lib/erlang/conformance.sh` uses `timeout 120` for the *entire* 9-suite chain. Even on a quiet machine the run needs 3m25s wall (36.8s user). This is not contention — it's an under-budgeted script.
3. **Watchdog over-conservatism:** `lib/tcl/test.sh` has `timeout 2400`. Quiet-machine wall is 57.8s — 41× under the deadline. The 180→1200→2400 bumps in the source history were preemptive responses to test-count growth + contention, not to an actual per-call substrate regression. The original 180s deadline is comfortable.
Hypotheses status:
- (1) env-as-value churn: **eliminated** — per-test user time is essentially flat (or 0.83× actually faster) baseline → current.
- (2) JIT cache miss / re-compile per call: **eliminated** — same.
- (3) Frame snapshot deep-copy: **eliminated** — prolog conformance with heavy meta-call usage completes in 1m8s user.
- (4) Lazy JIT bypassed: **eliminated** — same.
- (5) Type-check overhead: **eliminated** — same.
- (6) Frame representation: **eliminated** — same.
**Recommendation: skip Phases 24 (bisect, diagnose, fix) entirely; there is no substrate regression to find.** The plan's North-star outcome — restore Tcl's `test.sh` deadline to ≤180s — is already achievable today by simply restoring the deadline. Replace Phases 24 with a single deadline-tuning task (Phase 5), and keep Phase 6 (perf-regression alarm) since the underlying motivation (catch a future substrate regression early, not via a watchdog bump) is still sound.
Proposed Phase 5 (deadline tuning), pending user approval:
- `lib/tcl/test.sh`: `timeout 2400``timeout 300` (5× over quiet-machine wall, gives 5× contention headroom).
- `lib/erlang/conformance.sh`: `timeout 120``timeout 600` (the only genuinely too-tight deadline). Quiet wall 3m25s.
- Other guests' deadlines: leave as-is (already comfortable).
- No source-tree changes outside those two scripts.
Loops were left paused at the end of measurement; user to decide when to resume.