Files
rose-ash/plans/agent-briefings/sx-gate-loop.md
giles 01e5f876bc W14: pin K19 MCP-harness/runtime primitive parity (test-only)
mcp_tree.ml's parallel primitive table drifted from sx_primitives.ml —
the spec-mandated harness verification path silently produced false
findings ((get {:a 1} :a 99) -> nil vs 1, char-class vs substring split,
etc.). dc7aa709 aligned 8 entries as a stopgap; the real fix (linking
sx_primitives) is hosts-lane.

Add scripts/test-harness-parity.sh: drives mcp_tree.exe sx_eval via raw
JSON-RPC and a fresh sx_server.exe via the epoch protocol, runs the
finding's 12-probe battery through both, fails on any divergence (errors
compared by inner message). 12/12 parity today — the stopgap holds and
can no longer rot silently.

Test-only: no semantics edits, no push.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 01:41:07 +00:00

193 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# W14 — Test gate & conformance infrastructure loop
Forge agent **ws-W14**. Role: build out **W14** from the SX review remediation plan
(`plans/sx-review/PLAN.md`, §"W14. Test gate & conformance infrastructure") —
*the enabler that makes every other fix verifiable*. One checklist item per fire.
You are on branch `loops/sx-ws-w14`, worktree `/root/rose-ash-loops/sx-ws-w14`.
## Hard guardrails (read every fire)
- **TEST-ONLY.** No semantics edits. Do NOT touch `spec/evaluator.sx`,
`spec/primitives.sx`, `spec/parser.sx`, `spec/render.sx`, the OCaml kernel,
or any host runtime. W14 pins behavior with tests and productionizes the
*test/runner* surface; the actual fixes are other workstreams (W1W12).
A pin that *fails* means the finding regressed — do NOT relax the assertion,
record it as a blocker.
- **NO PUSH.** Commit locally on `loops/sx-ws-w14` only. Never push; never touch
`main` or `architecture`.
- **`.sx` files: use `sx-tree` MCP tools only** (a hook blocks Read/Write/Edit
on `.sx`). `sx_write_file` takes params **`file`** and **`source`** (NOT
`content` — a wrong key yields a `yojson … got null` error and no write).
`.md`/`.sh`/`.ml` files: normal tools are fine.
- **Never `pkill`/`kill` `sx_server`** — sibling loops share the binary. Bound
every run with `timeout` (e.g. `timeout 300 …`); if it hangs, let the timeout end it.
- **One item per fire, then stop.** No batching.
## Per-iteration procedure
1. Pick the first unchecked `[ ]` in the checklist.
2. Implement (test file or runner/harness change), lifting minimal repros from
the review lane files (`plans/sx-review/{core,hosts,conformance}.md`) — they
are a ready-made corpus of confirmed reprs.
3. Build + run the affected tests:
`sx_build` (target ocaml) then
`timeout 300 ./hosts/ocaml/_build/default/bin/run_tests.exe <test-name>`
to run a single file. New `spec/tests/test-*.sx` files are auto-discovered.
4. Confirm green (a pin must PASS on current HEAD — the fix already landed).
5. Commit locally: `git add -A && git commit` with a `W14:` prefix.
6. Tick the box, prepend one dated line to the Progress log, stop.
## Checklist
### A. Test-debt pins — dc7aa709's landed fixes shipped without regression tests
Pin each confirmed-and-fixed finding with a minimal repro. Add suites to
`spec/tests/test-gate-pins.sx` (one `defsuite` per finding).
- [x] K18 [W7] — `expt` overflow now float-promotes (no 63-bit wrap)
- [x] K20 [W7] — `contains?` now supports dict key membership
- [x] K09/K11/K39 [W5] — longhand `unquote-splicing`, guard sentinel gensym, `do` IIFE-head
- [x] K49 [W8] — five void elements (area/base/embed/param/track) renderable
(spec side; native regen drift → see Blocked). NB: the depth/cycle guard
is K16 [W8], still OPEN — not a W14 pin target until its fix lands
- [x] crit-2 [W1] — signal-return kont pinned NON-VACUOUSLY (side-effect
sentinel across two tests; a plain assert would inherit the vacuity)
- [x] C1/C1b [W3] — command-channel crash guards pinned
(`scripts/test-protocol-gate.sh`, seed for section E's fuzz suite)
- [x] S4 [hosts] — soft error pages not cached (HTTP-mode pin in
`scripts/test-protocol-gate.sh`; NB S4 lives in hosts.md, not
conformance — "housekeeping" was a mislabel from F-15's tag)
### B. Runner/production env unification
- [x] Audit runner-only bindings — inventory + bidirectional ledger in
`scripts/test-env-parity.sh` (KNOWN_DRIFT: values, call-with-values,
contains-char?, trim-right, sha3-256; consequence pin:
canonical-serialize broken on server; BOTH runners' sha3-256 are FAKE
stubs → test CIDs ≠ production CIDs)
### C. Harness honesty
- [x] K19 — harness/runtime parity pinned (`scripts/test-harness-parity.sh`:
drives mcp_tree sx_eval over JSON-RPC vs fresh sx_server over epoch,
12-probe battery from the finding, errors compared by message)
- [ ] C22/K104 — harness logs IO *before* invoking the mock (throwing-mock pin)
- [ ] C21 — real perform/suspend mode in harness
- [ ] C23 — adapter-dom render-output tests
### D. WASM corpus runner
- [ ] F2 — promote conformance's `run_wasm.js` prototype into CI
### E. Epoch-loop protocol fuzz + skip-list
- [ ] C3/C4/C5/C6/C7 — epoch protocol fuzz suite
- [ ] F10 — hs-upstream skip-list so browser-only FAILs mean something
- [ ] C9 — empty suite label
### F. Differential battery
- [ ] F8 — cross-host differential battery (same source, all hosts agree)
## Progress log (newest first)
- 2026-07-04 — **K19 harness-parity pin (item C.1)**. Authored
`scripts/test-harness-parity.sh`: drives `mcp_tree.exe` `sx_eval` with
raw JSON-RPC over stdio and a fresh `sx_server.exe` over the epoch
protocol, running the finding's exact 12-probe battery (empty?/get/
split/equal?/contains?/keyword-name/char-code/parse-number) through both
and failing on ANY divergence. Errors normalized to their inner message
so identical failures compare equal (`keyword-name :kw` errors the same
way on both — keywords evaluate to strings before the call). Result:
12/12 parity — dc7aa709's 8-entry stopgap alignment holds; this pin keeps
it honest until the real fix (mcp_tree links sx_primitives) lands in the
hosts lane. Test-only.
- 2026-07-04 — **Section B: env-parity audit + ledger**. Probed a fresh
`sx_server` over the epoch protocol (`deps-check` + live eval). Confirmed
runner-only drift: `values`/`call-with-values` (run_tests.ml:1131/1140),
`contains-char?` (rt.ml:728 + rt.js:85), `trim-right` (**JS runner only**
— absent even from the OCaml runner), `sha3-256` (rt.ml:745 + rt.js:88).
Consequence verified live: `(canonical-serialize 42)` on the server →
`Undefined symbol: contains-char?` (content addressing broken for ANY
number outside the runners). **Worse than the finding**: BOTH runners'
`sha3-256` are FAKE stubs (OCaml uses `Hashtbl.hash`!) while production
has real `crypto-sha3-256` — every CID computed in tests differs from
production CIDs. Authored `scripts/test-env-parity.sh` as a bidirectional
ledger: MUST_HAVE regressions fail; a KNOWN_DRIFT binding *appearing*
also fails (forces ledger + consequence-pin update when W5/W7/W12 land
fixes). 7/7 green. Test-only.
- 2026-07-04 — **S4 error-page-cache pin (item A.7) — section A COMPLETE**.
Extended `scripts/test-protocol-gate.sh` with an HTTP-mode case: fresh
`sx_server.exe --http <random-port>` (timeout-bounded, own PID killed at
end), GET the same nonexistent path twice, assert BOTH requests re-render
(2 `[sx-http]` lines — pre-fix the 2nd was cache-served at 0.0005s) and
the `[cache] … error page, not cached` is_err gate line appears. Findings
from prototyping: standalone worktree renders ALL docs pages as soft error
pages (no content), so a positive "real page IS cached" control is not
assertable here — documented in the script; startup takes ~12-15s (poll
loop, 40s budget). 5/5 protocol-gate green + 267/0 sx pins. Test-only.
- 2026-07-04 — **C1/C1b command-channel pins (item A.6)**. These are
protocol-level, not .sx-suite pins: authored
`scripts/test-protocol-gate.sh` — each case spawns its OWN timeout-bounded
`sx_server.exe` (no shared process touched) and asserts three things: an
`(error N "Malformed command line: ...")` response is emitted, the
follow-up epoch still evaluates (process survived), and no `Fatal error`
escapes / exit is clean. Cases: C1 unterminated list (exact review repro),
C1 plain-garbage line, C1b non-ASCII byte (`café`), plus a well-formed
control session. 4/4 green. The script is deliberately structured to grow
into section E's fuzz suite (C3C7). Test-only.
- 2026-07-04 — **crit-2 non-vacuous pin (item A.5)**. The original bug's
signature — handler value becomes the WHOLE program result, discarding
every outer frame *including the covering test's own assert* — means a
plain `(assert= repro expected)` pin would pass vacuously on regression.
Added suite `gate-crit2-signal-return-kont` with a **side-effect sentinel**:
test 1 runs both repros (`("outer" 43 "end")` list shape + `raise-continuable`
→ 143) then `set!`s a top-level flag; test 2 independently asserts the flag
— if the continuation is ever dropped again, test 1 "passes" but test 2
fails loudly. Third test pins the exact shipped-test expr (51). Verified
both repro shapes live via sx_eval first. 267 passed / 0 failed. Test-only.
- 2026-07-03 — **K49 void-elements pin (item A.4) + regen-drift DISCOVERY**.
Corrected the checklist label first: K49 is "five void elements
unrenderable" (core.md:335), not the depth guard (that's K16, OPEN). Added
suite `gate-K49-void-elements-renderable` (3 tests): spec `HTML_TAGS`
contains all five; `(render-to-html '(base :href "x") (make-env))`
`<base href="x" />`; all five render self-closing. Runner-env gotchas:
`current-env`/`symbol` are not bound in run_tests — use `(make-env)` and
literal quoted forms. **Discovery:** the first draft pinned via the
runner's native `render-html` and FAILED — `hosts/ocaml/lib/sx_render.ml`
(generated) was never regenerated after dc7aa709's spec fix, so the native
render path still errors on the five tags. Recorded under Blocked; live
evidence for F13 (regen-diff gate). 264 passed / 0 failed. Test-only.
- 2026-07-03 — **K09/K11/K39 W5 special-form pins (item A.3)**. Three suites
added to `spec/tests/test-gate-pins.sx`: `gate-K09-longhand-unquote-splicing`
(R7RS longhand `(unquote-splicing X)` now splices, incl. empty-list case;
shorthand still works), `gate-K11-guard-reraise-forgeable` (a body/clause
value shaped like `(list '__guard-reraise__ X)` is returned as data, not
misread as a re-raise — sentinel is now gensym'd), `gate-K39-do-iife-head`
(`(do ((fn (x) x) 5) 99)` → 99, not a misparsed do-loop — exact core.md
repro). Gotchas hit and fixed: quasiquoted bare idents are *symbols* not
strings, and `assert=` compares with `=` (not `equal?`, which returns false
on these spliced lists). 261 passed / 0 failed under OCaml run_tests. Test-only.
- 2026-07-03 — **K20 contains?-dict pin (item A.2)**. Mapped K-codes by
core.md severity order (K17 append!, K18 expt, K19 harness-drift, K20
contains?-dict). Added suite `gate-K20-contains-dict` to
`spec/tests/test-gate-pins.sx` (4 tests): present dict key → true, missing
key → false, list membership unchanged, string substring unchanged. Repro
from core.md ("(contains? {:a 1} :a) threw `contains?: 2 args`"). 8/8 green
across both suites under OCaml run_tests. Test-only.
- 2026-07-03 — **K18 expt-overflow pin (item A.1)**. Bootstrapped this briefing
from PLAN.md §W14 (the referenced file did not exist yet). Added
`spec/tests/test-gate-pins.sx` with suite `gate-K18-expt-overflow` (4 tests):
small exponents stay exact (`2^0=1`, `2^10=1024`), `2^62 > 0` (no negative
63-bit wrap), `2^100 > 0` (no wrap-to-zero), `2^100` is a number (float
promotion). Verified 4/4 green under the OCaml run_tests kernel. Test-only.
## Blocked
- **K49 native path — sx_render.ml regen drift** (found 2026-07-03 while
pinning A.4): dc7aa709 fixed HTML_TAGS in `spec/render.sx` but never re-ran
`hosts/ocaml/bootstrap_render.py`, so the generated
`hosts/ocaml/lib/sx_render.ml` still carries a stale `html_tags_list`
without area/base/embed/param/track. The runner's native `render-html`
convenience (and any native fast-path render) therefore STILL throws
`Undefined symbol: base` — dc7aa709's "verified on the native binary" claim
did not cover this path. Fix = regen (hosts lane, semantics-adjacent — out
of scope for this test-only loop). This is a live instance of **F13**
(regen-diff CI gate, section-B/D territory): a regen-diff check would have
caught it at commit time. The K49 pin covers the spec side only; when the
regen lands, extend the suite with `render-html`-path assertions.