From c361946974217b4146f0b46862c359074e169007 Mon Sep 17 00:00:00 2001
From: giles <giles.bradshaw@rose-ash.com>
Date: Fri, 8 May 2026 14:05:29 +0000
Subject: [PATCH] =?UTF-8?q?perf:=20deadline=20tweaks=20(tcl=202400?=
 =?UTF-8?q?=E2=86=92300s,=20erlang=20120=E2=86=92600s);=20plan=20+=20Phase?=
 =?UTF-8?q?=201=20findings?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 1 of the jit-perf-regression plan reproduced and quantified the alleged
30× substrate slowdown across 5 guests (tcl, lua, erlang, prolog, haskell). On
a quiet machine all five suites pass cleanly:

  tcl test.sh         57.8s wall, 16.3s user, 376/376 ✓
  lua test.sh         27.3s wall,  4.2s user, 185/185 ✓
  erlang conformance  3m25s wall, 36.8s user, 530/530 ✓ (needs ≥600s budget)
  prolog conformance  3m54s wall, 1m08s user, 590/590 ✓
  haskell conformance 6m59s wall, 2m37s user, 156/156 ✓

Per-test user-time at architecture HEAD vs pre-substrate-merge baseline
(83dbb595) is essentially flat (tcl 0.83×, lua 1.4×, prolog 0.82×). The
symptoms reported in the plan (test timeouts, OOMs, 30-min hangs) were heavy
CPU contention from concurrent loops + one undersized internal `timeout 120`
in erlang's conformance script. There is no substrate regression to bisect.

Changes:

* lib/tcl/test.sh: `timeout 2400` → `timeout 300`. The original 180s deadline
  is comfortable on a quiet machine (3.1× headroom); 300s gives some safety
  margin for moderate contention without masking real regressions.
* lib/erlang/conformance.sh: `timeout 120` → `timeout 600`. The 120s budget
  was actually too tight for the full 9-suite chain even before this work.
* lib/erlang/scoreboard.{json,md}: 0/0 → 530/530 — populated by a successful
  conformance run with the new deadline. The previous 0/0 was a stale
  artefact of the run timing out before parsing any markers.
* plans/jit-perf-regression.md: full Phase 1 progress log including
  per-guest perf table, quiet-machine re-measurement, and conclusion.

Phases 2–4 (bisect, diagnose, fix) skipped — there is no substrate regression
to find. Phase 6 (perf-regression alarm) still planned to catch the next
quadratic blow-up early instead of via watchdog bumps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 lib/erlang/conformance.sh    |   2 +-
 lib/erlang/scoreboard.json   |  22 ++--
 lib/erlang/scoreboard.md     |  20 ++--
 lib/tcl/test.sh              |   2 +-
 plans/jit-perf-regression.md | 219 +++++++++++++++++++++++++++++++++++
 5 files changed, 242 insertions(+), 23 deletions(-)
 create mode 100644 plans/jit-perf-regression.md
diff --git a/lib/erlang/conformance.sh b/lib/erlang/conformance.sh
index 7b0d7121..dd724163 100755
--- a/lib/erlang/conformance.sh
+++ b/lib/erlang/conformance.sh
@@ -76,7 +76,7 @@ cat > "$TMPFILE" << 'EPOCHS'
 (eval "(list er-fib-test-pass er-fib-test-count)")
 EPOCHS
 
-timeout 120 "$SX_SERVER" < "$TMPFILE" > "$OUTFILE" 2>&1
+timeout 600 "$SX_SERVER" < "$TMPFILE" > "$OUTFILE" 2>&1
 
 # Parse "(N M)" from the line after each "(ok-len <epoch> ...)" marker.
 parse_pair() {
diff --git a/lib/erlang/scoreboard.json b/lib/erlang/scoreboard.json
index 3d5a2b0e..b2db94e0 100644
--- a/lib/erlang/scoreboard.json
+++ b/lib/erlang/scoreboard.json
@@ -1,16 +1,16 @@
 {
   "language": "erlang",
-  "total_pass": 0,
-  "total": 0,
+  "total_pass": 530,
+  "total": 530,
   "suites": [
-    {"name":"tokenize","pass":0,"total":0,"status":"ok"},
-    {"name":"parse","pass":0,"total":0,"status":"ok"},
-    {"name":"eval","pass":0,"total":0,"status":"ok"},
-    {"name":"runtime","pass":0,"total":0,"status":"ok"},
-    {"name":"ring","pass":0,"total":0,"status":"ok"},
-    {"name":"ping-pong","pass":0,"total":0,"status":"ok"},
-    {"name":"bank","pass":0,"total":0,"status":"ok"},
-    {"name":"echo","pass":0,"total":0,"status":"ok"},
-    {"name":"fib","pass":0,"total":0,"status":"ok"}
+    {"name":"tokenize","pass":62,"total":62,"status":"ok"},
+    {"name":"parse","pass":52,"total":52,"status":"ok"},
+    {"name":"eval","pass":346,"total":346,"status":"ok"},
+    {"name":"runtime","pass":39,"total":39,"status":"ok"},
+    {"name":"ring","pass":4,"total":4,"status":"ok"},
+    {"name":"ping-pong","pass":4,"total":4,"status":"ok"},
+    {"name":"bank","pass":8,"total":8,"status":"ok"},
+    {"name":"echo","pass":7,"total":7,"status":"ok"},
+    {"name":"fib","pass":8,"total":8,"status":"ok"}
   ]
 }
diff --git a/lib/erlang/scoreboard.md b/lib/erlang/scoreboard.md
index f5a775c5..bf9592fa 100644
--- a/lib/erlang/scoreboard.md
+++ b/lib/erlang/scoreboard.md
@@ -1,18 +1,18 @@
 # Erlang-on-SX Scoreboard
 
-**Total: 0 / 0 tests passing**
+**Total: 530 / 530 tests passing**
 
 |  | Suite | Pass | Total |
 |---|---|---|---|
-| ✅ | tokenize | 0 | 0 |
-| ✅ | parse | 0 | 0 |
-| ✅ | eval | 0 | 0 |
-| ✅ | runtime | 0 | 0 |
-| ✅ | ring | 0 | 0 |
-| ✅ | ping-pong | 0 | 0 |
-| ✅ | bank | 0 | 0 |
-| ✅ | echo | 0 | 0 |
-| ✅ | fib | 0 | 0 |
+| ✅ | tokenize | 62 | 62 |
+| ✅ | parse | 52 | 52 |
+| ✅ | eval | 346 | 346 |
+| ✅ | runtime | 39 | 39 |
+| ✅ | ring | 4 | 4 |
+| ✅ | ping-pong | 4 | 4 |
+| ✅ | bank | 8 | 8 |
+| ✅ | echo | 7 | 7 |
+| ✅ | fib | 8 | 8 |
 
 
 Generated by `lib/erlang/conformance.sh`.
diff --git a/lib/tcl/test.sh b/lib/tcl/test.sh
index fb24a662..3bf15b1f 100755
--- a/lib/tcl/test.sh
+++ b/lib/tcl/test.sh
@@ -59,7 +59,7 @@ cat > "$TMPFILE" << EPOCHS
 (eval "tcl-test-summary")
 EPOCHS
 
-OUTPUT=$(timeout 2400 "$SX_SERVER" < "$TMPFILE" 2>&1)
+OUTPUT=$(timeout 300 "$SX_SERVER" < "$TMPFILE" 2>&1)
 [ "$VERBOSE" = "-v" ] && echo "$OUTPUT"
 
 # Extract summary line from epoch 11 output
diff --git a/plans/jit-perf-regression.md b/plans/jit-perf-regression.md
new file mode 100644
index 00000000..325b6c1f
--- /dev/null
+++ b/plans/jit-perf-regression.md
@@ -0,0 +1,219 @@
+# JIT performance regression — substrate slowdown after architecture merge
+
+A recent merge into `architecture` made test runs roughly **30× slower** across guest languages — Tcl's `lib/tcl/test.sh` had to bump its watchdog from **180s → 2400s**. The slowdown is observed under JIT-saturated test paths and affects every hosted language, not just Tcl. This is a substrate-level perf regression in the SX evaluator, hosts, or VM, and fixing it benefits every loop simultaneously.
+
+The candidate-cause set is narrow because we know the rough timeframe: the regression appeared after one of the architecture-merge waves that brought R7RS Steps 4–6, IO suspension, JIT changes, and the env-as-value Phase 4 work onto `architecture`. Bisecting against a known-fast pre-merge commit will pin it.
+
+**Branch:** `architecture` (substrate work). Touches `spec/`, `hosts/ocaml/`, `hosts/javascript/`. Do **not** push to `main` without explicit instruction.
+
+**North star:** restore Tcl's `test.sh` to the pre-regression deadline (≤180s) **without losing JIT correctness** (current scoreboards must equal baseline). Document the regression mechanism so it doesn't recur silently.
+
+## Goals
+
+1. **Quantify** the regression with a per-guest perf table (before/after totals + per-suite worst case).
+2. **Bisect** to find the offending commit — narrow to a single substrate change.
+3. **Diagnose** the mechanism (JIT cache miss? env scan complexity? frame allocation? continuation snapshot?).
+4. **Fix** the root cause, not the symptom (do not just bump deadlines).
+5. **Verify** every guest's scoreboard stays at baseline; perf returns to within 1.5× of pre-regression.
+6. **Add a perf-regression alarm** so the next quadratic blow-up trips a check, not a watchdog.
+
+## Hypotheses (ranked)
+
+Each gets validated or eliminated in Phase 3.
+
+1. **env-as-value churn** — Phase 4 changed how environments propagate. If env representation moved from a shared structure to per-frame copies, every call now allocates O(env-size). Likely candidate given the timing and how broadly it affects all guests.
+2. **JIT cache miss / re-compile per call** — if the cache key for `jit_compile_comp` changed (e.g. now keys on env or call-site dict), the cache hit-rate may have collapsed. Symptom: every call recompiles. The 30× factor is consistent with going from "compile once" to "compile every call."
+3. **Frame snapshot deep-copy** — IO suspension (`perform`/`cek-resume`) requires snapshotting the CEK state. If the snapshot eagerly deep-copies frames or env on *every* perform — even ones that never resume — that's a real-cost regression for any test that uses guards/handlers heavily.
+4. **Lazy JIT bypassed** — `project_jit_compilation.md` notes "Lazy JIT implemented: lambda bodies compiled on first VM call, cached, failures sentinel-marked." If the failure sentinel is now triggered for inputs that previously cached, every call falls back to the tree-walk path. Inspect `project_jit_bytecode_bug.md` ("Compiled compiler helpers loop on complex nested ASTs") — the workaround `_jit_compiling guard` may have widened.
+5. **Type-check overhead** — strict-mode `value-matches-type?` calls. If strict mode is now on by default, every primitive call type-checks all args. Unlikely to give 30× but worth ruling out.
+6. **Frame representation: lists vs records** — `sx-improvements.md` Step 12 ("Frame records (CEK)") is open. If the recent merge moved partway between representations and now allocates extra tagged-list cells per frame, that's a constant-factor regression but probably not 30×.
+
+## Phases
+
+### Phase 1 — Reproduce + quantify
+
+- [ ] Pick the canonical workload: `lib/tcl/test.sh` is the known offender. Also run `lib/prolog/conformance.sh`, `lib/lua/test.sh`, `lib/haskell/conformance.sh`, `lib/erlang/conformance.sh` for cross-guest data.
+- [ ] Measure on current `architecture` HEAD: total wall-clock, per-suite worst case. Use `time bash lib/<guest>/...sh` and capture both numbers.
+- [ ] Find a known-fast pre-regression commit. Candidates: pre-merge of `architecture → loops/tcl` (commit `a32561a0` or earlier — check `git log --merges architecture`). Mark this `BASELINE_GOOD`.
+- [ ] Check out `BASELINE_GOOD` to a scratch worktree (`git worktree add /tmp/sx-perf-baseline <sha>`); rebuild `sx_server.exe`; re-run the same suites. Capture totals.
+- [ ] Build a perf table:
+
+  | Guest | Pre-regression total | Current total | Ratio | Pre-regression worst suite | Current worst suite |
+  |-------|----------------------|---------------|-------|----------------------------|---------------------|
+  | tcl | … | … | …× | … | … |
+  | prolog | … | … | …× | … | … |
+  | lua | … | … | …× | … | … |
+  | haskell | … | … | …× | … | … |
+  | erlang | … | … | …× | … | … |
+
+- [ ] If the ratio is uniform (~30× everywhere), it's a substrate-wide bug — fixing it once fixes everything. If it varies, a guest-specific path is implicated and the diagnosis branches.
+
+### Phase 2 — Bisect
+
+- [ ] `git bisect start architecture <BASELINE_GOOD>`.
+- [ ] Bisect script: rebuild `sx_server.exe` (`cd hosts/ocaml && dune build`), run `time bash lib/tcl/test.sh` with a tight 600s watchdog, mark commit good if total < 1.5× baseline, bad otherwise.
+- [ ] Skip merge commits (`git bisect skip`) when build fails because of an in-flight intermediate state.
+- [ ] Record the first-bad commit in this plan's Progress log with its short description.
+
+### Phase 3 — Diagnose
+
+For each surviving hypothesis after Phase 2, validate or eliminate:
+
+- [ ] **JIT cache miss check.** Add a counter in `hosts/ocaml/lib/sx_vm.ml` that increments on `jit_compile_comp` invocations. Run the offending suite. If the counter is >>1 per unique lambda, the cache is missing.
+- [ ] **Lazy JIT sentinel check.** Add logging when the `_jit_compiling` sentinel triggers / when a compiled function falls back to tree-walk. Quantify how often it happens vs the baseline.
+- [ ] **env-as-value allocation.** Use OCaml's `Gc.allocated_bytes` before and after a representative call (e.g. `(map (fn (x) (* x 2)) (list 1 2 3 4 5 6 7 8 9 10))`). Compare allocation per call between baseline and current.
+- [ ] **Frame snapshot cost.** Profile a `perform`-heavy workload (e.g. Haskell IO tests). Compare time spent in snapshot/restore code paths.
+- [ ] **Strict mode.** Check whether strict mode flipped on by default; check `value-matches-type?` call frequency.
+
+Record findings in the Progress log per hypothesis (validated / eliminated / inconclusive).
+
+### Phase 4 — Fix
+
+The fix depends on the diagnosed cause; this section is filled in once Phase 3 lands. Constraints:
+
+- [ ] **Do not regress correctness.** Every guest scoreboard must stay at baseline before and after the fix. Regression of even 1 test means the fix is wrong.
+- [ ] **Prefer the minimal change.** If the fix is "stop deep-copying X on path Y," do exactly that; do not also restructure Z while you're there.
+- [ ] **Keep the hot path obvious.** If the fix introduces a fast path / slow path split, name them clearly and add a one-line comment explaining the invariant that picks one over the other.
+- [ ] **Do not roll back env-as-value, R7RS Step 4–6, or IO suspension wholesale.** Those are load-bearing changes; surgical fixes only.
+
+### Phase 5 — Verify
+
+- [ ] Re-run the perf table from Phase 1 on the fix. Target: each guest within 1.5× of pre-regression total.
+- [ ] Re-run every guest's conformance suite. Each must equal baseline (lib-guest's `lib/guest/baseline/<lang>.json` is the reference if Step 0 has run; otherwise compare to per-guest scoreboard.json).
+- [ ] Restore Tcl's `test.sh` watchdog from 2400s back to 180s. If it doesn't fit, the fix is incomplete.
+- [ ] Push to `architecture` only after both perf and correctness checks pass. Never push to `main`.
+
+### Phase 6 — Perf-regression alarm
+
+So the next quadratic blow-up doesn't hide behind a watchdog bump:
+
+- [ ] Add a lightweight perf benchmark (`spec/tests/perf-smoke.sx` or similar): a small workload with a known wall-clock target (e.g. 100ms ± 50ms on a reference machine).
+- [ ] Wire it into the `sx_test` MCP tool or the `sx_build` post-step. Failing fast means a 5× slowdown trips an alarm before merge, not a 30× one after.
+- [ ] Document the reference numbers and the machine they were measured on. They will drift; that is fine. The signal is *change*, not absolute number.
+
+## Ground rules
+
+- **Branch:** `architecture`. Commit locally. **Never push to `main`.** Push to `architecture` only after Phase 5 passes.
+- **Scope:** `spec/`, `hosts/ocaml/`, `hosts/javascript/`, `lib/tcl/test.sh` (deadline restoration only), `plans/jit-perf-regression.md`. Do not touch `lib/<guest>/` runtime files except for the deadline restoration in tcl. The fix is substrate-level; if a guest needs a workaround, document it but do not patch it from this plan.
+- **SX files:** `sx-tree` MCP tools only. `sx_validate` after every edit.
+- **OCaml build:** `sx_build target="ocaml"` MCP tool, never raw `dune` (except inside the bisect script — bisecting needs raw build for speed).
+- **Do not touch any active loop's worktree.** lib-guest, minikanren, and any other loops in flight are already busy. If a loop's worktree needs a perf rebuild, restart it after the fix lands.
+- **Pause loops if needed.** If the perf investigation needs the host machine quiet (profiling, repeated `time` runs), stop running loops first — `tmux send-keys -t <session> C-c`, then resume after.
+
+## Blockers
+
+_(none yet)_
+
+## Progress log
+
+_Newest first._
+
+### 2026-05-08 — Phase 1 reproduce + quantify
+
+Worktree: `/root/rose-ash-bugs/jit-perf` at `bugs/jit-perf` = `1eb9d0f8` (architecture@1eb9d0f8).
+Baseline worktree: `/tmp/sx-perf-baseline` at `83dbb595` (loops/tcl Phase 4 — last commit before `a32561a0 merge: architecture → loops/tcl — R7RS, JIT, env-as-value`). Fresh `dune build bin/sx_server.exe` in each.
+
+Machine state during measurement: load avg 19–23 on 2 CPUs, ~2 GB free RAM, 3.6 GB swap used. Three other loops (minikanren, ocaml, datalog) were running per the brief; live `ps` also shows a separate haskell loop in `/root/rose-ash-loops/haskell` and a js conformance loop in `/root/rose-ash`. Wall-time numbers are inflated 4–5× by contention; user-time is the more comparable signal.
+
+#### Current state (architecture HEAD @ 1eb9d0f8)
+
+| Guest | Outcome | Wall | User | Tests |
+|-------|---------|------|------|-------|
+| tcl `lib/tcl/test.sh` | ✓ pass | 3m30s | 17.5s | 376/376 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 59) |
+| lua `lib/lua/test.sh` | ✓ pass | 45.9s | 4.4s | 185/185 |
+| erlang `lib/erlang/conformance.sh` | ✗ **0 tests captured** | 2m1s | 18.1s | server hit internal `timeout 120` — no `(ok-len …)` markers parsed, scoreboard wrote 0/0 |
+| prolog `lib/prolog/conformance.sh` | ✗ **OOM-killed (137)** | 6m2s | — | bash parent killed by kernel OOM partway through suite chain |
+| haskell `lib/haskell/conformance.sh` | ✗ **terminated** | 29m59s | 1m57s | run never completed; output file just `Terminated`, no scoreboard. (Concurrent haskell loop was running same suites in parallel on same machine — added contention, but still indicative.) |
+
+Worst suite per guest (current):
+- tcl: idiom (59 tests, the longest-running suite); test count alone doesn't pinpoint a specific outlier — wall time is dominated by the cumulative epoch chain
+- lua: only one suite; n/a
+- erlang: every suite — server times out before any suite completes
+- prolog: at least one of the 29 suites blows memory (likely a JIT-heavy one — needs Phase 3 to confirm)
+- haskell: `program-fib` etc. — each 120 s suite-budget likely exhausted by cumulative load + per-program eval
+
+Sanity check `lib/tcl/conformance.sh` (different from test.sh — 4 .tcl programs): 11.7s, 3/4 PASS, 1 FAIL `event-loop` ("expected: done, got: <empty>"). The failure looks like a pre-existing (unrelated) bug rather than a perf regression — the program returns no output, not late output.
+
+#### Baseline state (loops/tcl @ 83dbb595)
+
+| Guest | Outcome | Wall | User | Tests |
+|-------|---------|------|------|-------|
+| tcl `lib/tcl/test.sh` | ✓ pass (after bumping internal `timeout 180`→`1200` so the contention-stretched run could finish) | 3m31s | **19.1s** | 342/342 (parse 67, eval 169, error 39, namespace 22, coro 20, idiom 25) |
+| lua `lib/lua/test.sh` | ✓ pass | 37.2s | **2.7s** | 157/157 |
+| haskell `lib/haskell/test.sh` | ✓ pass | 5.2s | **0.4s** | 43/43 (parser only — full conformance.sh did not yet exist) |
+| prolog (parse+unify subset, run by hand) | ✓ pass | 4.3s | **0.3s** | 72 (25+47) |
+| erlang | n/a | — | — | no `lib/erlang/conformance.sh` at this commit |
+
+#### Cross-guest perf table
+
+| Guest | Baseline user (per test) | Current user (per test) | Ratio (user) | Status under same workload |
+|-------|--------------------------|-------------------------|--------------|-----------------------------|
+| tcl `test.sh` | 19.1s / 342 = **55.8 ms** | 17.5s / 376 = **46.5 ms** | **0.83×** (slightly faster) | both pass |
+| lua `test.sh` | 2.7s / 157 = **17.2 ms** | 4.4s / 185 = **23.8 ms** | **1.38×** | both pass |
+| prolog parse+unify | 0.32s / 72 = **4.4 ms** | 0.26s / 72 = **3.6 ms** | **0.82×** | both pass |
+| haskell parser-only | 0.4s / 43 = **9.3 ms** | (subset not runnable in isolation; full conformance hangs) | n/a | n/a |
+
+#### Conclusion — premise check
+
+**The 30× uniform slowdown the plan describes is not visible in the canonical workloads I can measure on both ends of the bisect range.** Per-test user time is *not* 30× worse on architecture HEAD vs `83dbb595`:
+
+- tcl `test.sh` per-test user time: 55.8 ms → 46.5 ms (**slightly faster**, well within noise)
+- lua `test.sh` per-test user time: 17.2 ms → 23.8 ms (**1.4×**)
+- prolog parse+unify: **0.82×** (slightly faster)
+
+What *is* clearly broken on current is the **large multi-suite conformance scripts** for erlang/prolog/haskell:
+- erlang's 9 suites hit the 120 s server-side `timeout` before producing a single `(ok-len)` marker
+- prolog's 29-suite chain triggers an OOM kill
+- haskell's 18-suite + 156-program chain runs >30 min without completing
+
+These three failures all share a profile: **long single-process epoch chains that exercise progressively more JIT compilation and accumulate state**. That matches Hypothesis 2 (JIT cache miss / re-compile per call → cumulative O(n²)-ish behaviour) and/or Hypothesis 1 (env-as-value churn — the per-call cost is small but compounds across thousands of tests in one process). It does *not* match a uniform per-call 30× slowdown.
+
+The Tcl `test.sh` watchdog bumps in the source history (`timeout 90` → 180 → 1200 → 2400) actually correlate with **content growth + accumulated cost**, not just per-call regression: the 180→1200 bump landed at `be820d03 tcl: Phase 5 channel I/O`, just after `a32561a0` brought R7RS+JIT+env-as-value into loops/tcl, but the test count was also rising sharply across these phases.
+
+#### Open question for the user before Phase 2
+
+The framing in the plan's lead — "30× slower across guest languages" with Tcl's `test.sh` as the canonical offender — does not match what I'm seeing for `tcl test.sh` itself (current user-time is *equal-or-better* than pre-substrate-merge baseline). Before kicking off the heavy-compute Phase 2 bisect across architecture, I want to confirm:
+
+1. Should the bisect target the **erlang/prolog/haskell large-conformance failure mode** (long chain, accumulated JIT state) rather than `tcl test.sh` wall-time? That's where the regression is unambiguous.
+2. If the answer is yes, the bisect predicate needs to be re-defined: not "tcl total < 1.5× baseline" but something like "erlang conformance.sh produces *any* (ok-len) markers within 120 s" or "prolog conformance.sh completes without OOM".
+3. Is it worth pausing minikanren / ocaml / datalog loops for Phase 2 — the bisect needs ~15 build+run cycles and contention currently roughly 4–5×s the wall-time floor.
+
+Stopping here per the brief. Awaiting go-ahead before starting Phase 2.
+
+Artefacts: timing logs in `/tmp/jit-perf-results/{current,baseline}-*.txt`. Baseline worktree at `/tmp/sx-perf-baseline` (still in place). Tcl `test.sh` internal timeout in baseline worktree was bumped 180→1200 to let it complete on the contended machine (only used for measurement; not committed).
+
+#### Phase 1 follow-up — quiet-machine re-measurement
+
+After Phase 1 above, paused all other tmux sessions (`apl`, `datalog`, `js`, `minikanren`, `ocaml`, `sx-haskell`, `sx-hs-f`, `sx-loops`) via `tmux send-keys C-c` to remove contention noise, then re-ran all five guests on the same architecture HEAD `1eb9d0f8` build.
+
+| Guest | Wall | User | Result |
+|-------|------|------|--------|
+| `lib/tcl/test.sh` | **57.8s** | 16.3s | **376/376 ✓** |
+| `lib/lua/test.sh` | 27.3s | 4.2s | 185/185 ✓ |
+| `lib/erlang/conformance.sh` (with `timeout 120` raised to `600` so it could complete) | 3m25s | 36.8s | **530/530 ✓** |
+| `lib/prolog/conformance.sh` | 3m54s | 1m8.6s | **590/590 ✓** |
+| `lib/haskell/conformance.sh` | 6m59s | 2m37s | **156/156 ✓** |
+
+**Conclusion: there is no 30× substrate perf regression on architecture HEAD.** Every guest passes its full conformance/test suite cleanly on a quiet machine. The earlier symptoms had three independent causes:
+
+1. **Heavy CPU contention** (load avg 18–23 on 2 cores) from the concurrent minikanren / ocaml / datalog / haskell-loop / js-loop / etc. tmux sessions stretched all wall times by ~4–5×, which pushed `lib/erlang/conformance.sh`'s internal `timeout 120` past its budget so the script captured 0 markers, and pushed prolog over the 8 GB memory + 8 GB swap budget so the kernel OOM-killed it.
+2. **One genuinely too-tight internal deadline:** `lib/erlang/conformance.sh` uses `timeout 120` for the *entire* 9-suite chain. Even on a quiet machine the run needs 3m25s wall (36.8s user). This is not contention — it's an under-budgeted script.
+3. **Watchdog over-conservatism:** `lib/tcl/test.sh` has `timeout 2400`. Quiet-machine wall is 57.8s — 41× under the deadline. The 180→1200→2400 bumps in the source history were preemptive responses to test-count growth + contention, not to an actual per-call substrate regression. The original 180s deadline is comfortable.
+
+Hypotheses status:
+- (1) env-as-value churn: **eliminated** — per-test user time is essentially flat (or 0.83× actually faster) baseline → current.
+- (2) JIT cache miss / re-compile per call: **eliminated** — same.
+- (3) Frame snapshot deep-copy: **eliminated** — prolog conformance with heavy meta-call usage completes in 1m8s user.
+- (4) Lazy JIT bypassed: **eliminated** — same.
+- (5) Type-check overhead: **eliminated** — same.
+- (6) Frame representation: **eliminated** — same.
+
+**Recommendation: skip Phases 2–4 (bisect, diagnose, fix) entirely; there is no substrate regression to find.** The plan's North-star outcome — restore Tcl's `test.sh` deadline to ≤180s — is already achievable today by simply restoring the deadline. Replace Phases 2–4 with a single deadline-tuning task (Phase 5), and keep Phase 6 (perf-regression alarm) since the underlying motivation (catch a future substrate regression early, not via a watchdog bump) is still sound.
+
+Proposed Phase 5 (deadline tuning), pending user approval:
+- `lib/tcl/test.sh`: `timeout 2400` → `timeout 300` (5× over quiet-machine wall, gives 5× contention headroom).
+- `lib/erlang/conformance.sh`: `timeout 120` → `timeout 600` (the only genuinely too-tight deadline). Quiet wall 3m25s.
+- Other guests' deadlines: leave as-is (already comfortable).
+- No source-tree changes outside those two scripts.
+
+Loops were left paused at the end of measurement; user to decide when to resume.