# W14 — Test gate & conformance infrastructure loop

Forge agent **ws-W14**. Role: build out **W14** from the SX review remediation plan
(`plans/sx-review/PLAN.md`, §"W14. Test gate & conformance infrastructure") —
*the enabler that makes every other fix verifiable*. One checklist item per fire.

You are on branch `loops/sx-ws-w14`, worktree `/root/rose-ash-loops/sx-ws-w14`.

## Hard guardrails (read every fire)

- **TEST-ONLY.** No semantics edits. Do NOT touch `spec/evaluator.sx`,
  `spec/primitives.sx`, `spec/parser.sx`, `spec/render.sx`, the OCaml kernel,
  or any host runtime. W14 pins behavior with tests and productionizes the
  *test/runner* surface; the actual fixes are other workstreams (W1–W12).
  A pin that *fails* means the finding regressed — do NOT relax the assertion,
  record it as a blocker.
- **NO PUSH.** Commit locally on `loops/sx-ws-w14` only. Never push; never touch
  `main` or `architecture`.
- **`.sx` files: use `sx-tree` MCP tools only** (a hook blocks Read/Write/Edit
  on `.sx`). `sx_write_file` takes params **`file`** and **`source`** (NOT
  `content` — a wrong key yields a `yojson … got null` error and no write).
  `.md`/`.sh`/`.ml` files: normal tools are fine.
- **Never `pkill`/`kill` `sx_server`** — sibling loops share the binary. Bound
  every run with `timeout` (e.g. `timeout 300 …`); if it hangs, let the timeout end it.
- **One item per fire, then stop.** No batching.

## Per-iteration procedure

1. Pick the first unchecked `[ ]` in the checklist.
2. Implement (test file or runner/harness change), lifting minimal repros from
   the review lane files (`plans/sx-review/{core,hosts,conformance}.md`) — they
   are a ready-made corpus of confirmed reprs.
3. Build + run the affected tests:
   `sx_build` (target ocaml) then
   `timeout 300 ./hosts/ocaml/_build/default/bin/run_tests.exe <test-name>`
   to run a single file. New `spec/tests/test-*.sx` files are auto-discovered.
4. Confirm green (a pin must PASS on current HEAD — the fix already landed).
5. Commit locally: `git add -A && git commit` with a `W14:` prefix.
6. Tick the box, prepend one dated line to the Progress log, stop.

## Checklist

### A. Test-debt pins — dc7aa709's landed fixes shipped without regression tests
Pin each confirmed-and-fixed finding with a minimal repro. Add suites to
`spec/tests/test-gate-pins.sx` (one `defsuite` per finding).

- [x] K18 [W7] — `expt` overflow now float-promotes (no 63-bit wrap)
- [x] K20 [W7] — `contains?` now supports dict key membership
- [x] K09/K11/K39 [W5] — longhand `unquote-splicing`, guard sentinel gensym, `do` IIFE-head
- [x] K49 [W8] — five void elements (area/base/embed/param/track) renderable
      (spec side; native regen drift → see Blocked). NB: the depth/cycle guard
      is K16 [W8], still OPEN — not a W14 pin target until its fix lands
- [x] crit-2 [W1] — signal-return kont pinned NON-VACUOUSLY (side-effect
      sentinel across two tests; a plain assert would inherit the vacuity)
- [x] C1/C1b [W3] — command-channel crash guards pinned
      (`scripts/test-protocol-gate.sh`, seed for section E's fuzz suite)
- [x] S4 [hosts] — soft error pages not cached (HTTP-mode pin in
      `scripts/test-protocol-gate.sh`; NB S4 lives in hosts.md, not
      conformance — "housekeeping" was a mislabel from F-15's tag)

### B. Runner/production env unification
- [x] Audit runner-only bindings — inventory + bidirectional ledger in
      `scripts/test-env-parity.sh` (KNOWN_DRIFT: values, call-with-values,
      contains-char?, trim-right, sha3-256; consequence pin:
      canonical-serialize broken on server; BOTH runners' sha3-256 are FAKE
      stubs → test CIDs ≠ production CIDs)

### C. Harness honesty
- [x] K19 — harness/runtime parity pinned (`scripts/test-harness-parity.sh`:
      drives mcp_tree sx_eval over JSON-RPC vs fresh sx_server over epoch,
      12-probe battery from the finding, errors compared by message)
- [x] C22/K104 — FIXED harness (spec/harness.sx make-interceptor: log entry
      appended before the mock runs, :result updated via dict-set!) + 3 pins
- [x] C21 — BUILT `harness-run-perform` (spec/harness.sx): drives real CEK
      suspend/resume, services performs from session mocks, C22-style
      logging; 5 pins incl. the S10 map-over-perform probe (CEK keeps all
      elements — the drop class is serving-JIT-side). Runner-only (needs
      cek-* driver bindings)
- [x] C23 — adapter-dom render-output tests
      (`web/tests/test-adapter-dom-render.sx`, 8 tests vs runner mock DOM;
      follow-up depth still open: boolean attrs, on-*/bind/ref/key,
      reactive attrs, hydration cursor)

### D. WASM corpus runner
- [x] F2 — BUILT `hosts/ocaml/browser/run_wasm_corpus.js` (one file per
      node process, shipped-kernel boot per test_wasm_native.js) +
      `scripts/test-wasm-corpus.sh` sweep driver with SKIP/KNOWN_FAIL
      ledger. Baseline: 83 files, 80 fully green, 5192 passes, 0 test
      failures; 3 partial load-errors (hash-table/r7rs/sets, opaque jsoo
      exception mid-file). Full sweep ~13 min — wiring into
      sx-build-all.sh left as maintainer call (gate definition D3)

### E. Epoch-loop protocol fuzz + skip-list
- [x] C3/C4/C5/C6/C7 — protocol-quirk ledger (pins current behavior,
      bidirectional) + seeded 60-line fuzz-liveness property in
      `scripts/test-protocol-gate.sh` (11/11)
- [ ] F10 — hs-upstream skip-list so browser-only FAILs mean something
- [ ] C9 — empty suite label

### F. Differential battery
- [ ] F8 — cross-host differential battery (same source, all hosts agree)

## Progress log (newest first)

- 2026-07-04 — **C3–C7 protocol fuzz suite (item E.1)**. All five findings
  are still OPEN server-side (sx_server.ml fixes are host-runtime work),
  so the suite pins CURRENT behavior as a bidirectional ledger — verified
  each live first: C3 stray io-response → extra Unknown-command reply
  (dead 13-vs-14-char guard); C4 malformed (epoch) → error reply + stale
  epoch tag (envelope changed since the finding: dc7aa709's guard now
  answers rather than ignores); C5 decreasing epoch accepted; C6 two
  commands one line → one error, neither runs; C7 vm-trace sans compiler →
  opaque "Not callable: nil". Plus a real fuzz property: 60
  deterministically-seeded hostile lines (unbalanced parens, control chars,
  unicode, 2KB lines, stray io-responses, epoch mutations) then a
  well-formed command — server must still answer and exit cleanly.
  protocol-gate now 11/11. When a server fix lands, the matching ledger
  pin fails loudly → update to assert the corrected behavior. Test-only.
- 2026-07-04 — **F2 WASM corpus runner (section D COMPLETE)**. The review's
  headline conformance gap: no runner ever fed spec/tests through the
  SHIPPED browser artifact (F-1/F-3 divergences existed undetected). Built
  `run_wasm_corpus.js` (boots sx_browser.bc.wasm.js headless in Node with
  the test_wasm_native.js stub block, loads the 23 web-stack modules,
  registers framework hooks, runs ONE file per process → parseable
  `CORPUS-RESULT` line; process isolation means a hung file can't kill the
  sweep) + `scripts/test-wasm-corpus.sh` (sweep driver, SKIP/KNOWN_FAIL
  ledger with green-flip detection). **Empirical baseline: 83 files, 80
  fully green, 5192 passes, ZERO test failures on the shipped kernel** —
  including test-gate-pins (29/29) and test-letrec-resume (the kernel
  provides cek-* driver bindings, broader than bare sx_server). 3 partial
  load-errors (test-hash-table 22p, test-r7rs 87p, test-sets 30p — opaque
  jsoo exception mid-file, diagnosing which form = follow-up). Full sweep
  ~13 min; CI wiring deferred to the D3 gate-definition decision. Test-only.
- 2026-07-04 — **C23 adapter-dom render-output tests (item C.4) — section C
  COMPLETE**. Key discovery: the "browser-only" exclusion of adapter-dom
  testing is FALSE for render output — `(import (web adapter-dom))`
  disk-resolves in the OCaml runner and `render-to-dom` works against its
  mock DOM (dom-* → host-* → mock elements). New
  `web/tests/test-adapter-dom-render.sx` (8 tests): tag/text-child-node,
  class+id, ordered children, void element, when-false empty FRAGMENT,
  when-true branch-in-fragment, map N-children-in-fragment, if inlines
  branch. Probed the adapter's output contract first (text = nodeType-3
  child; control flow = FRAGMENT wrapper; if inlines). Auto-included in
  default runs (not on the exclusion list) — first render-output coverage
  of the 1512-line adapter in the standard gate. Follow-up depth (boolean
  attrs, on-*/bind/ref/key, reactive, hydration) noted on the checklist.
  254/0 standalone. Test-only.
- 2026-07-04 — **C21 perform-mode harness (item C.3)**. Added
  `harness-run-perform` to spec/harness.sx (exported): drives
  `make-cek-state`/`cek-step-loop`, services each
  `(perform {:op X :args L})` suspension from the session's platform mocks
  (entry logged before invocation, C22-consistent), `cek-resume`s with the
  mock value, loops to terminal. Self-recursion via the `(self self …)`
  pattern (avoids letrec-injection K06 territory). Extracted the arity
  dispatch into shared `harness-invoke-mock`. 5 pins in
  `gate-C21-perform-mode-harness` — notably the **S10 probe**: `(map (fn (u)
  (perform …)) '("a" "b" "c"))` keeps ALL elements through 3 suspensions on
  the CEK path, confirming the element-drop class is serving-JIT-side, not
  CEK. Caveat noted in the docstring: needs the runner's cek-* driver
  bindings (absent on bare sx_server/MCP — the env-parity theme again).
  290/0. Test-infra-only.
- 2026-07-04 — **C22/K104 throwing-mock fix + pins (item C.2)**. First
  actual FIX of the loop — in scope because spec/harness.sx is W14-owned
  test infrastructure (PLAN approach item 4 assigns "log IO before invoking
  the mock" to W14). TDD: reproduced pre-fix (caught error, 0 log entries),
  then restructured `make-interceptor` to append the entry BEFORE the mock
  runs (`:result nil` while pending, `dict-set!` in place on return).
  Verified: throwing mock leaves entry, happy path updates result, mixed
  sequence counts all 3. Added suite `gate-C22-throwing-mock-logged`
  (3 tests). Harness self-suite (15) + test-relate-picker (only other
  harness consumer) green; 285/0 pins run. Tooling notes: replace/insert
  tools take `new_source` (not `replacement`); find_all paths still
  disagree with read_subtree/replace_node on define-library files —
  sx_write_file remains the reliable route. Test-infra-only.
- 2026-07-04 — **K19 harness-parity pin (item C.1)**. Authored
  `scripts/test-harness-parity.sh`: drives `mcp_tree.exe` `sx_eval` with
  raw JSON-RPC over stdio and a fresh `sx_server.exe` over the epoch
  protocol, running the finding's exact 12-probe battery (empty?/get/
  split/equal?/contains?/keyword-name/char-code/parse-number) through both
  and failing on ANY divergence. Errors normalized to their inner message
  so identical failures compare equal (`keyword-name :kw` errors the same
  way on both — keywords evaluate to strings before the call). Result:
  12/12 parity — dc7aa709's 8-entry stopgap alignment holds; this pin keeps
  it honest until the real fix (mcp_tree links sx_primitives) lands in the
  hosts lane. Test-only.
- 2026-07-04 — **Section B: env-parity audit + ledger**. Probed a fresh
  `sx_server` over the epoch protocol (`deps-check` + live eval). Confirmed
  runner-only drift: `values`/`call-with-values` (run_tests.ml:1131/1140),
  `contains-char?` (rt.ml:728 + rt.js:85), `trim-right` (**JS runner only**
  — absent even from the OCaml runner), `sha3-256` (rt.ml:745 + rt.js:88).
  Consequence verified live: `(canonical-serialize 42)` on the server →
  `Undefined symbol: contains-char?` (content addressing broken for ANY
  number outside the runners). **Worse than the finding**: BOTH runners'
  `sha3-256` are FAKE stubs (OCaml uses `Hashtbl.hash`!) while production
  has real `crypto-sha3-256` — every CID computed in tests differs from
  production CIDs. Authored `scripts/test-env-parity.sh` as a bidirectional
  ledger: MUST_HAVE regressions fail; a KNOWN_DRIFT binding *appearing*
  also fails (forces ledger + consequence-pin update when W5/W7/W12 land
  fixes). 7/7 green. Test-only.
- 2026-07-04 — **S4 error-page-cache pin (item A.7) — section A COMPLETE**.
  Extended `scripts/test-protocol-gate.sh` with an HTTP-mode case: fresh
  `sx_server.exe --http <random-port>` (timeout-bounded, own PID killed at
  end), GET the same nonexistent path twice, assert BOTH requests re-render
  (2 `[sx-http]` lines — pre-fix the 2nd was cache-served at 0.0005s) and
  the `[cache] … error page, not cached` is_err gate line appears. Findings
  from prototyping: standalone worktree renders ALL docs pages as soft error
  pages (no content), so a positive "real page IS cached" control is not
  assertable here — documented in the script; startup takes ~12-15s (poll
  loop, 40s budget). 5/5 protocol-gate green + 267/0 sx pins. Test-only.
- 2026-07-04 — **C1/C1b command-channel pins (item A.6)**. These are
  protocol-level, not .sx-suite pins: authored
  `scripts/test-protocol-gate.sh` — each case spawns its OWN timeout-bounded
  `sx_server.exe` (no shared process touched) and asserts three things: an
  `(error N "Malformed command line: ...")` response is emitted, the
  follow-up epoch still evaluates (process survived), and no `Fatal error`
  escapes / exit is clean. Cases: C1 unterminated list (exact review repro),
  C1 plain-garbage line, C1b non-ASCII byte (`café`), plus a well-formed
  control session. 4/4 green. The script is deliberately structured to grow
  into section E's fuzz suite (C3–C7). Test-only.
- 2026-07-04 — **crit-2 non-vacuous pin (item A.5)**. The original bug's
  signature — handler value becomes the WHOLE program result, discarding
  every outer frame *including the covering test's own assert* — means a
  plain `(assert= repro expected)` pin would pass vacuously on regression.
  Added suite `gate-crit2-signal-return-kont` with a **side-effect sentinel**:
  test 1 runs both repros (`("outer" 43 "end")` list shape + `raise-continuable`
  → 143) then `set!`s a top-level flag; test 2 independently asserts the flag
  — if the continuation is ever dropped again, test 1 "passes" but test 2
  fails loudly. Third test pins the exact shipped-test expr (51). Verified
  both repro shapes live via sx_eval first. 267 passed / 0 failed. Test-only.
- 2026-07-03 — **K49 void-elements pin (item A.4) + regen-drift DISCOVERY**.
  Corrected the checklist label first: K49 is "five void elements
  unrenderable" (core.md:335), not the depth guard (that's K16, OPEN). Added
  suite `gate-K49-void-elements-renderable` (3 tests): spec `HTML_TAGS`
  contains all five; `(render-to-html '(base :href "x") (make-env))` →
  `<base href="x" />`; all five render self-closing. Runner-env gotchas:
  `current-env`/`symbol` are not bound in run_tests — use `(make-env)` and
  literal quoted forms. **Discovery:** the first draft pinned via the
  runner's native `render-html` and FAILED — `hosts/ocaml/lib/sx_render.ml`
  (generated) was never regenerated after dc7aa709's spec fix, so the native
  render path still errors on the five tags. Recorded under Blocked; live
  evidence for F13 (regen-diff gate). 264 passed / 0 failed. Test-only.
- 2026-07-03 — **K09/K11/K39 W5 special-form pins (item A.3)**. Three suites
  added to `spec/tests/test-gate-pins.sx`: `gate-K09-longhand-unquote-splicing`
  (R7RS longhand `(unquote-splicing X)` now splices, incl. empty-list case;
  shorthand still works), `gate-K11-guard-reraise-forgeable` (a body/clause
  value shaped like `(list '__guard-reraise__ X)` is returned as data, not
  misread as a re-raise — sentinel is now gensym'd), `gate-K39-do-iife-head`
  (`(do ((fn (x) x) 5) 99)` → 99, not a misparsed do-loop — exact core.md
  repro). Gotchas hit and fixed: quasiquoted bare idents are *symbols* not
  strings, and `assert=` compares with `=` (not `equal?`, which returns false
  on these spliced lists). 261 passed / 0 failed under OCaml run_tests. Test-only.
- 2026-07-03 — **K20 contains?-dict pin (item A.2)**. Mapped K-codes by
  core.md severity order (K17 append!, K18 expt, K19 harness-drift, K20
  contains?-dict). Added suite `gate-K20-contains-dict` to
  `spec/tests/test-gate-pins.sx` (4 tests): present dict key → true, missing
  key → false, list membership unchanged, string substring unchanged. Repro
  from core.md ("(contains? {:a 1} :a) threw `contains?: 2 args`"). 8/8 green
  across both suites under OCaml run_tests. Test-only.
- 2026-07-03 — **K18 expt-overflow pin (item A.1)**. Bootstrapped this briefing
  from PLAN.md §W14 (the referenced file did not exist yet). Added
  `spec/tests/test-gate-pins.sx` with suite `gate-K18-expt-overflow` (4 tests):
  small exponents stay exact (`2^0=1`, `2^10=1024`), `2^62 > 0` (no negative
  63-bit wrap), `2^100 > 0` (no wrap-to-zero), `2^100` is a number (float
  promotion). Verified 4/4 green under the OCaml run_tests kernel. Test-only.

## Blocked
- **K49 native path — sx_render.ml regen drift** (found 2026-07-03 while
  pinning A.4): dc7aa709 fixed HTML_TAGS in `spec/render.sx` but never re-ran
  `hosts/ocaml/bootstrap_render.py`, so the generated
  `hosts/ocaml/lib/sx_render.ml` still carries a stale `html_tags_list`
  without area/base/embed/param/track. The runner's native `render-html`
  convenience (and any native fast-path render) therefore STILL throws
  `Undefined symbol: base` — dc7aa709's "verified on the native binary" claim
  did not cover this path. Fix = regen (hosts lane, semantics-adjacent — out
  of scope for this test-only loop). This is a live instance of **F13**
  (regen-diff CI gate, section-B/D territory): a regen-diff check would have
  caught it at commit time. The K49 pin covers the spec side only; when the
  regen lands, extend the suite with `render-html`-path assertions.