Files
rose-ash/plans/agent-briefings/sx-gate-loop.md
giles 5e0abced32 W14: F8 cross-host differential battery (test-only) — CHECKLIST COMPLETE
Committed replacement for the review's ephemeral 130-probe corpus:
spec/tests/differential-probes.txt (49 probes: F-1 int/float display, K18
overflow, F-3 apply + dict order, S-4 float printing, strings,
collections, special forms, error normalization) evaluated on the native
server (epoch protocol printer) and the SHIPPED WASM kernel
(eval_wasm_probes.js via guest sx-serialize), diffed by
scripts/test-differential.sh with a KNOWN_DIVERGENT heal-detecting ledger.

Result: 46/49 agree. All 3 divergences share one root cause, verified
live: bare sx_server's `apply` does not spread its argument list —
(apply + (list 1 2 3)) errors "Expected number, got list", (apply str l)
returns the serialized list; the WASM kernel spreads correctly and the
test runner masks the bug with its own apply binding (F-7 class).

Finding refinement: F-1's float-display divergence (0.3 vs
0.30000000000000004) is a K.eval JS-boundary artifact — guest-serialized
output agrees across hosts; the battery therefore compares guest
serialization.

This completes the W14 checklist: 7 pin suites, 6 gate scripts/runners,
2 harness capabilities, C9 label cleanup, adapter-dom render coverage.

Test-only: no semantics edits, no push.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 05:03:03 +00:00

22 KiB
Raw Blame History

W14 — Test gate & conformance infrastructure loop

Forge agent ws-W14. Role: build out W14 from the SX review remediation plan (plans/sx-review/PLAN.md, §"W14. Test gate & conformance infrastructure") — the enabler that makes every other fix verifiable. One checklist item per fire.

You are on branch loops/sx-ws-w14, worktree /root/rose-ash-loops/sx-ws-w14.

Hard guardrails (read every fire)

  • TEST-ONLY. No semantics edits. Do NOT touch spec/evaluator.sx, spec/primitives.sx, spec/parser.sx, spec/render.sx, the OCaml kernel, or any host runtime. W14 pins behavior with tests and productionizes the test/runner surface; the actual fixes are other workstreams (W1W12). A pin that fails means the finding regressed — do NOT relax the assertion, record it as a blocker.
  • NO PUSH. Commit locally on loops/sx-ws-w14 only. Never push; never touch main or architecture.
  • .sx files: use sx-tree MCP tools only (a hook blocks Read/Write/Edit on .sx). sx_write_file takes params file and source (NOT content — a wrong key yields a yojson … got null error and no write). .md/.sh/.ml files: normal tools are fine.
  • Never pkill/kill sx_server — sibling loops share the binary. Bound every run with timeout (e.g. timeout 300 …); if it hangs, let the timeout end it.
  • One item per fire, then stop. No batching.

Per-iteration procedure

  1. Pick the first unchecked [ ] in the checklist.
  2. Implement (test file or runner/harness change), lifting minimal repros from the review lane files (plans/sx-review/{core,hosts,conformance}.md) — they are a ready-made corpus of confirmed reprs.
  3. Build + run the affected tests: sx_build (target ocaml) then timeout 300 ./hosts/ocaml/_build/default/bin/run_tests.exe <test-name> to run a single file. New spec/tests/test-*.sx files are auto-discovered.
  4. Confirm green (a pin must PASS on current HEAD — the fix already landed).
  5. Commit locally: git add -A && git commit with a W14: prefix.
  6. Tick the box, prepend one dated line to the Progress log, stop.

Checklist

A. Test-debt pins — dc7aa709's landed fixes shipped without regression tests

Pin each confirmed-and-fixed finding with a minimal repro. Add suites to spec/tests/test-gate-pins.sx (one defsuite per finding).

  • K18 [W7] — expt overflow now float-promotes (no 63-bit wrap)
  • K20 [W7] — contains? now supports dict key membership
  • K09/K11/K39 [W5] — longhand unquote-splicing, guard sentinel gensym, do IIFE-head
  • K49 [W8] — five void elements (area/base/embed/param/track) renderable (spec side; native regen drift → see Blocked). NB: the depth/cycle guard is K16 [W8], still OPEN — not a W14 pin target until its fix lands
  • crit-2 [W1] — signal-return kont pinned NON-VACUOUSLY (side-effect sentinel across two tests; a plain assert would inherit the vacuity)
  • C1/C1b [W3] — command-channel crash guards pinned (scripts/test-protocol-gate.sh, seed for section E's fuzz suite)
  • S4 [hosts] — soft error pages not cached (HTTP-mode pin in scripts/test-protocol-gate.sh; NB S4 lives in hosts.md, not conformance — "housekeeping" was a mislabel from F-15's tag)

B. Runner/production env unification

  • Audit runner-only bindings — inventory + bidirectional ledger in scripts/test-env-parity.sh (KNOWN_DRIFT: values, call-with-values, contains-char?, trim-right, sha3-256; consequence pin: canonical-serialize broken on server; BOTH runners' sha3-256 are FAKE stubs → test CIDs ≠ production CIDs)

C. Harness honesty

  • K19 — harness/runtime parity pinned (scripts/test-harness-parity.sh: drives mcp_tree sx_eval over JSON-RPC vs fresh sx_server over epoch, 12-probe battery from the finding, errors compared by message)
  • C22/K104 — FIXED harness (spec/harness.sx make-interceptor: log entry appended before the mock runs, :result updated via dict-set!) + 3 pins
  • C21 — BUILT harness-run-perform (spec/harness.sx): drives real CEK suspend/resume, services performs from session mocks, C22-style logging; 5 pins incl. the S10 map-over-perform probe (CEK keeps all elements — the drop class is serving-JIT-side). Runner-only (needs cek-* driver bindings)
  • C23 — adapter-dom render-output tests (web/tests/test-adapter-dom-render.sx, 8 tests vs runner mock DOM; follow-up depth still open: boolean attrs, on-*/bind/ref/key, reactive attrs, hydration cursor)

D. WASM corpus runner

  • F2 — BUILT hosts/ocaml/browser/run_wasm_corpus.js (one file per node process, shipped-kernel boot per test_wasm_native.js) + scripts/test-wasm-corpus.sh sweep driver with SKIP/KNOWN_FAIL ledger. Baseline: 83 files, 80 fully green, 5192 passes, 0 test failures; 3 partial load-errors (hash-table/r7rs/sets, opaque jsoo exception mid-file). Full sweep ~13 min — wiring into sx-build-all.sh left as maintainer call (gate definition D3)

E. Epoch-loop protocol fuzz + skip-list

  • C3/C4/C5/C6/C7 — protocol-quirk ledger (pins current behavior, bidirectional) + seeded 60-line fuzz-liveness property in scripts/test-protocol-gate.sh (11/11)
  • F10 — expected-failures BASELINE GATE instead of a skip-list (scripts/test-suite-baseline.sh + spec/tests/known-failures.txt, 273 pinned: 271 hs-* + 2 empty-suite-label entries → C9 evidence). New failure OR vanished failure = red; hs loops' scoreboards untouched
  • C9 — empty suite labels ELIMINATED: 6 files had suite-less top-level deftests (chars 43, import-bind 14, ports 12, let-match 8, math nested-deftests, 4 hs strays) — wrapped/restructured into defsuites; baseline identities updated in the same commit

F. Differential battery

  • F8 — cross-host differential battery: spec/tests/differential-probes.txt (49 probes) × native server vs shipped WASM kernel via scripts/test-differential.sh + eval_wasm_probes.js. 46 agree, 3 ledgered KNOWN_DIVERGENT (F-3: bare-server apply does not spread — runner masks it, F-7 class). Refinement: the F-1 float-display divergence is a K.eval JS-boundary artifact — guest sx-serialize output AGREES across hosts

CHECKLIST COMPLETE 2026-07-04 — all W14 items delivered. Open handoffs: sx_render.ml regen drift (Blocked, hosts lane), adapter-dom depth tests, 3 WASM load-error bisects (hash-table/r7rs/sets), CI wiring of the four gate scripts (D3 maintainer decision).

Progress log (newest first)

  • 2026-07-04 — F8 differential battery — CHECKLIST COMPLETE. Committed replacement for the review's ephemeral 130-probe corpus: spec/tests/differential-probes.txt (49 probes across F-1 int/float display, K18 overflow, F-3 apply + dict order, S-4 float printing, strings/collections/special forms/error cases) evaluated on the native server (epoch protocol) and the shipped WASM kernel (eval_wasm_probes.js, guest sx-serialize), diffed by scripts/test-differential.sh with a KNOWN_DIVERGENT ledger (heal → red → delete entry). Result: 46/49 agree; 3 divergences, all one root cause — bare sx_server's apply does not spread its arg list ((apply + (list 1 2 3)) → "Expected number, got list"; WASM spreads correctly; the test runner masks it with its own apply — F-7 class). Finding refinement: F-1's float-display divergence (0.3 vs 0.3000…4) is purely a K.eval JS-boundary artifact — guest-serialized output agrees. W14 delivered: 7 pin suites (spec/tests/test-gate-pins.sx, 29 tests), 4 gate scripts (protocol-gate 11, env-parity 7, harness-parity 12, wasm-corpus 83-file, suite-baseline 273-pin, differential 49-probe), 2 harness capabilities (C22 log-first, C21 perform-mode), C9 label cleanup, adapter-dom render coverage. Test-only throughout.
  • 2026-07-04 — C9 empty-suite labels (item E.3) — section E COMPLETE. The sweep found the defect much wider than the finding: SIX files carried suite-less top-level deftests (test-chars 43, test-import-bind 14, test-ports 12, test-let-match 8, test-math as deftest-nested-in-deftest, test-hyperscript-conformance 4 strays between suites). Fixes: file-level defsuite wraps (validated via sx_validate after mechanical wrap), test-math restructured deftest→defsuite (labels now "math > sin"), hs strays wrapped in section-comment-named suites (hs-compat- blockLiteral/cookies/some/where). The two baseline-visible identities renamed in known-failures.txt in the SAME commit. Full-gate validated GREEN (5798p/273f — 2 passes are the wrapper deftests that no longer self-report; fail set byte-identical). Test-only.
  • 2026-07-04 — F10 baseline gate (item E.2). Deliberately NOT a skip-list: skip-listing the hs red band in the runner would rewrite the hs loops' scoreboards mid-flight. Instead scripts/test-suite-baseline.sh diffs the full suite's FAIL set against checked-in spec/tests/known-failures.txt (273 entries: 271 hs-* + 2 with EMPTY suite labels — live C9 evidence, can-map-an-array "map with block" and string->number 2-arg, the "r7rs radix shadow"). Red on a NEW failure (regression) and red on a VANISHED failure (fix landed — delete from baseline, locking in the win). Identity = "suite > name" with error text stripped (messages churn). Current suite: 5800p/273f (up 38 passes from dc7aa709's 5762 — sections AD added pins). Validated end-to-end: GREEN, exit 0, ~12 min runtime. Test-only.
  • 2026-07-04 — C3C7 protocol fuzz suite (item E.1). All five findings are still OPEN server-side (sx_server.ml fixes are host-runtime work), so the suite pins CURRENT behavior as a bidirectional ledger — verified each live first: C3 stray io-response → extra Unknown-command reply (dead 13-vs-14-char guard); C4 malformed (epoch) → error reply + stale epoch tag (envelope changed since the finding: dc7aa709's guard now answers rather than ignores); C5 decreasing epoch accepted; C6 two commands one line → one error, neither runs; C7 vm-trace sans compiler → opaque "Not callable: nil". Plus a real fuzz property: 60 deterministically-seeded hostile lines (unbalanced parens, control chars, unicode, 2KB lines, stray io-responses, epoch mutations) then a well-formed command — server must still answer and exit cleanly. protocol-gate now 11/11. When a server fix lands, the matching ledger pin fails loudly → update to assert the corrected behavior. Test-only.
  • 2026-07-04 — F2 WASM corpus runner (section D COMPLETE). The review's headline conformance gap: no runner ever fed spec/tests through the SHIPPED browser artifact (F-1/F-3 divergences existed undetected). Built run_wasm_corpus.js (boots sx_browser.bc.wasm.js headless in Node with the test_wasm_native.js stub block, loads the 23 web-stack modules, registers framework hooks, runs ONE file per process → parseable CORPUS-RESULT line; process isolation means a hung file can't kill the sweep) + scripts/test-wasm-corpus.sh (sweep driver, SKIP/KNOWN_FAIL ledger with green-flip detection). Empirical baseline: 83 files, 80 fully green, 5192 passes, ZERO test failures on the shipped kernel — including test-gate-pins (29/29) and test-letrec-resume (the kernel provides cek-* driver bindings, broader than bare sx_server). 3 partial load-errors (test-hash-table 22p, test-r7rs 87p, test-sets 30p — opaque jsoo exception mid-file, diagnosing which form = follow-up). Full sweep ~13 min; CI wiring deferred to the D3 gate-definition decision. Test-only.
  • 2026-07-04 — C23 adapter-dom render-output tests (item C.4) — section C COMPLETE. Key discovery: the "browser-only" exclusion of adapter-dom testing is FALSE for render output — (import (web adapter-dom)) disk-resolves in the OCaml runner and render-to-dom works against its mock DOM (dom-* → host-* → mock elements). New web/tests/test-adapter-dom-render.sx (8 tests): tag/text-child-node, class+id, ordered children, void element, when-false empty FRAGMENT, when-true branch-in-fragment, map N-children-in-fragment, if inlines branch. Probed the adapter's output contract first (text = nodeType-3 child; control flow = FRAGMENT wrapper; if inlines). Auto-included in default runs (not on the exclusion list) — first render-output coverage of the 1512-line adapter in the standard gate. Follow-up depth (boolean attrs, on-*/bind/ref/key, reactive, hydration) noted on the checklist. 254/0 standalone. Test-only.
  • 2026-07-04 — C21 perform-mode harness (item C.3). Added harness-run-perform to spec/harness.sx (exported): drives make-cek-state/cek-step-loop, services each (perform {:op X :args L}) suspension from the session's platform mocks (entry logged before invocation, C22-consistent), cek-resumes with the mock value, loops to terminal. Self-recursion via the (self self …) pattern (avoids letrec-injection K06 territory). Extracted the arity dispatch into shared harness-invoke-mock. 5 pins in gate-C21-perform-mode-harness — notably the S10 probe: (map (fn (u) (perform …)) '("a" "b" "c")) keeps ALL elements through 3 suspensions on the CEK path, confirming the element-drop class is serving-JIT-side, not CEK. Caveat noted in the docstring: needs the runner's cek-* driver bindings (absent on bare sx_server/MCP — the env-parity theme again). 290/0. Test-infra-only.
  • 2026-07-04 — C22/K104 throwing-mock fix + pins (item C.2). First actual FIX of the loop — in scope because spec/harness.sx is W14-owned test infrastructure (PLAN approach item 4 assigns "log IO before invoking the mock" to W14). TDD: reproduced pre-fix (caught error, 0 log entries), then restructured make-interceptor to append the entry BEFORE the mock runs (:result nil while pending, dict-set! in place on return). Verified: throwing mock leaves entry, happy path updates result, mixed sequence counts all 3. Added suite gate-C22-throwing-mock-logged (3 tests). Harness self-suite (15) + test-relate-picker (only other harness consumer) green; 285/0 pins run. Tooling notes: replace/insert tools take new_source (not replacement); find_all paths still disagree with read_subtree/replace_node on define-library files — sx_write_file remains the reliable route. Test-infra-only.
  • 2026-07-04 — K19 harness-parity pin (item C.1). Authored scripts/test-harness-parity.sh: drives mcp_tree.exe sx_eval with raw JSON-RPC over stdio and a fresh sx_server.exe over the epoch protocol, running the finding's exact 12-probe battery (empty?/get/ split/equal?/contains?/keyword-name/char-code/parse-number) through both and failing on ANY divergence. Errors normalized to their inner message so identical failures compare equal (keyword-name :kw errors the same way on both — keywords evaluate to strings before the call). Result: 12/12 parity — dc7aa709's 8-entry stopgap alignment holds; this pin keeps it honest until the real fix (mcp_tree links sx_primitives) lands in the hosts lane. Test-only.
  • 2026-07-04 — Section B: env-parity audit + ledger. Probed a fresh sx_server over the epoch protocol (deps-check + live eval). Confirmed runner-only drift: values/call-with-values (run_tests.ml:1131/1140), contains-char? (rt.ml:728 + rt.js:85), trim-right (JS runner only — absent even from the OCaml runner), sha3-256 (rt.ml:745 + rt.js:88). Consequence verified live: (canonical-serialize 42) on the server → Undefined symbol: contains-char? (content addressing broken for ANY number outside the runners). Worse than the finding: BOTH runners' sha3-256 are FAKE stubs (OCaml uses Hashtbl.hash!) while production has real crypto-sha3-256 — every CID computed in tests differs from production CIDs. Authored scripts/test-env-parity.sh as a bidirectional ledger: MUST_HAVE regressions fail; a KNOWN_DRIFT binding appearing also fails (forces ledger + consequence-pin update when W5/W7/W12 land fixes). 7/7 green. Test-only.
  • 2026-07-04 — S4 error-page-cache pin (item A.7) — section A COMPLETE. Extended scripts/test-protocol-gate.sh with an HTTP-mode case: fresh sx_server.exe --http <random-port> (timeout-bounded, own PID killed at end), GET the same nonexistent path twice, assert BOTH requests re-render (2 [sx-http] lines — pre-fix the 2nd was cache-served at 0.0005s) and the [cache] … error page, not cached is_err gate line appears. Findings from prototyping: standalone worktree renders ALL docs pages as soft error pages (no content), so a positive "real page IS cached" control is not assertable here — documented in the script; startup takes ~12-15s (poll loop, 40s budget). 5/5 protocol-gate green + 267/0 sx pins. Test-only.
  • 2026-07-04 — C1/C1b command-channel pins (item A.6). These are protocol-level, not .sx-suite pins: authored scripts/test-protocol-gate.sh — each case spawns its OWN timeout-bounded sx_server.exe (no shared process touched) and asserts three things: an (error N "Malformed command line: ...") response is emitted, the follow-up epoch still evaluates (process survived), and no Fatal error escapes / exit is clean. Cases: C1 unterminated list (exact review repro), C1 plain-garbage line, C1b non-ASCII byte (café), plus a well-formed control session. 4/4 green. The script is deliberately structured to grow into section E's fuzz suite (C3C7). Test-only.
  • 2026-07-04 — crit-2 non-vacuous pin (item A.5). The original bug's signature — handler value becomes the WHOLE program result, discarding every outer frame including the covering test's own assert — means a plain (assert= repro expected) pin would pass vacuously on regression. Added suite gate-crit2-signal-return-kont with a side-effect sentinel: test 1 runs both repros (("outer" 43 "end") list shape + raise-continuable → 143) then set!s a top-level flag; test 2 independently asserts the flag — if the continuation is ever dropped again, test 1 "passes" but test 2 fails loudly. Third test pins the exact shipped-test expr (51). Verified both repro shapes live via sx_eval first. 267 passed / 0 failed. Test-only.
  • 2026-07-03 — K49 void-elements pin (item A.4) + regen-drift DISCOVERY. Corrected the checklist label first: K49 is "five void elements unrenderable" (core.md:335), not the depth guard (that's K16, OPEN). Added suite gate-K49-void-elements-renderable (3 tests): spec HTML_TAGS contains all five; (render-to-html '(base :href "x") (make-env))<base href="x" />; all five render self-closing. Runner-env gotchas: current-env/symbol are not bound in run_tests — use (make-env) and literal quoted forms. Discovery: the first draft pinned via the runner's native render-html and FAILED — hosts/ocaml/lib/sx_render.ml (generated) was never regenerated after dc7aa709's spec fix, so the native render path still errors on the five tags. Recorded under Blocked; live evidence for F13 (regen-diff gate). 264 passed / 0 failed. Test-only.
  • 2026-07-03 — K09/K11/K39 W5 special-form pins (item A.3). Three suites added to spec/tests/test-gate-pins.sx: gate-K09-longhand-unquote-splicing (R7RS longhand (unquote-splicing X) now splices, incl. empty-list case; shorthand still works), gate-K11-guard-reraise-forgeable (a body/clause value shaped like (list '__guard-reraise__ X) is returned as data, not misread as a re-raise — sentinel is now gensym'd), gate-K39-do-iife-head ((do ((fn (x) x) 5) 99) → 99, not a misparsed do-loop — exact core.md repro). Gotchas hit and fixed: quasiquoted bare idents are symbols not strings, and assert= compares with = (not equal?, which returns false on these spliced lists). 261 passed / 0 failed under OCaml run_tests. Test-only.
  • 2026-07-03 — K20 contains?-dict pin (item A.2). Mapped K-codes by core.md severity order (K17 append!, K18 expt, K19 harness-drift, K20 contains?-dict). Added suite gate-K20-contains-dict to spec/tests/test-gate-pins.sx (4 tests): present dict key → true, missing key → false, list membership unchanged, string substring unchanged. Repro from core.md ("(contains? {:a 1} :a) threw contains?: 2 args"). 8/8 green across both suites under OCaml run_tests. Test-only.
  • 2026-07-03 — K18 expt-overflow pin (item A.1). Bootstrapped this briefing from PLAN.md §W14 (the referenced file did not exist yet). Added spec/tests/test-gate-pins.sx with suite gate-K18-expt-overflow (4 tests): small exponents stay exact (2^0=1, 2^10=1024), 2^62 > 0 (no negative 63-bit wrap), 2^100 > 0 (no wrap-to-zero), 2^100 is a number (float promotion). Verified 4/4 green under the OCaml run_tests kernel. Test-only.

Blocked

  • K49 native path — sx_render.ml regen drift (found 2026-07-03 while pinning A.4): dc7aa709 fixed HTML_TAGS in spec/render.sx but never re-ran hosts/ocaml/bootstrap_render.py, so the generated hosts/ocaml/lib/sx_render.ml still carries a stale html_tags_list without area/base/embed/param/track. The runner's native render-html convenience (and any native fast-path render) therefore STILL throws Undefined symbol: base — dc7aa709's "verified on the native binary" claim did not cover this path. Fix = regen (hosts lane, semantics-adjacent — out of scope for this test-only loop). This is a live instance of F13 (regen-diff CI gate, section-B/D territory): a regen-diff check would have caught it at commit time. The K49 pin covers the spec side only; when the regen lands, extend the suite with render-html-path assertions.