Merge loops/conformance into architecture: A1 conformance-driver migration

Migrate 4 hand-rolled conformance.sh onto the shared driver (lib/guest/ conformance.sh) with verified count parity, exclude 5 foreign-program runners, and extend the driver to support per-suite counter names + per-suite preloads. Migrated: common-lisp counters 487/487 (+182 the old timeout-30 silently dropped) erlang dict 761/761 feed counters 189/189 (+ lib/feed/test-harness.sx) go dict 609/609 Excluded (foreign runners, coverage would be lost): forth (Hayes core.fr via awk+python), js (test262 .js vs .expected), ocaml (scrapes test.sh + .ml baseline), smalltalk (scrapes test.sh + .st corpus), tcl (.tcl vs # expected:). Driver: MODE=counters gains backward-compatible per-suite fields name:file[:pass-var:fail-var[:extra-preload ...]] (verified non-regressing against the existing haskell counters path).
2026-06-07 14:11:28 +00:00
parent 24349d2d52 0061db393c
commit db76cc8c65
16 changed files with 557 additions and 623 deletions
--- a/plans/agent-briefings/conformance-loop.md
+++ b/plans/agent-briefings/conformance-loop.md
@@ -0,0 +1,192 @@
+# A1 conformance-driver migration loop
+
+Role: migrate every remaining subsystem that hand-rolls its own `conformance.sh`
+onto the **shared conformance driver** (`lib/guest/conformance.sh` + `lib/guest/conformance.sx`),
+one subsystem per iteration, **verifying test-count parity before every commit**.
+This executes item **A1** from the radar backlog (`plans/abstractions.md`, read-only
+context). You are an implementer, not a scout.
+
+You are on branch `loops/conformance`, worktree `/root/rose-ash-loops/conformance`.
+
+## Hard safety rails (read every time)
+
+- **NEVER push to `main` or `architecture`.** Push only to `origin/loops/conformance`.
+- **NEVER `pkill`/`kill` `sx_server` or any shared process** — sibling loops share the
+  binary. Bound every test run with `timeout` (e.g. `timeout 600 bash …`). If a run
+  hangs, let the timeout end it; never kill globally.
+- **One subsystem per iteration, then stop.** No batching.
+- **Never commit a regression.** If post-migration test counts don't match the baseline
+  (or an error appears), REVERT (`git checkout -- lib/<x>/conformance.sh` and
+  `rm -f lib/<x>/conformance.conf`) and record the blocker — do not commit.
+- `.sx` files: use the `sx-tree` MCP tools, never Read/Write/Edit. `.sh`/`.conf`/`.md`
+  files: normal tools are fine.
+- Preserve the `bash lib/<x>/conformance.sh` entry point (the shim keeps it working) so
+  no other loop is disrupted.
+
+## The candidate worklist
+
+Remaining hand-rolled `conformance.sh` (from radar A1): **common-lisp, erlang, feed,
+forth, go, js, ocaml, smalltalk, tcl**. Already migrated (do not touch): acl, apl,
+datalog, haskell, mod, prolog. Already excluded (different harness): lua.
+
+Work them roughly simplest-first. Track status in the checklist at the bottom.
+
+## What "fits the driver" means — classify FIRST
+
+The shared driver works for subsystems whose tests are **SX test-suites loaded over the
+epoch protocol** and run by an expression that emits a counter/dict scoreboard. It does
+NOT fit subsystems that run **foreign source programs** through a separate runner
+(e.g. lua walks `*.lua` via Python; smalltalk runs `*.st` via `test.sh`).
+
+Per candidate, before migrating, decide:
+- **Migratable** — its `conformance.sh` epoch-loads SX preloads and evals SX test suites
+  → proceed to migrate.
+- **Excluded** — it shells out to a foreign program runner / scrapes a `test.sh` →
+  DO NOT migrate. Record the exclusion (one line in the checklist + a `git`-free note in
+  this briefing's Progress log) with the reason, and move on. Excluding is a valid,
+  honest result — a forced migration that loses coverage is worse than none.
+
+## Per-iteration procedure
+
+1. **Pick** the next `[ ]` candidate in the checklist.
+2. **Read** its `lib/<x>/conformance.sh` in full. Read the two recipe templates —
+   `lib/haskell/conformance.conf` (MODE=counters) and `lib/prolog/conformance.conf`
+   (MODE=dict) — and skim `lib/guest/conformance.sh` + `lib/guest/conformance.sx`.
+3. **Classify** (above). If Excluded → record reason, tick as excluded, stop.
+4. **Baseline:** `timeout 600 bash lib/<x>/conformance.sh`, then read
+   `lib/<x>/scoreboard.json` and record the pass/total. This is the parity target.
+5. **Author `lib/<x>/conformance.conf`:**
+   - `LANG_NAME=<x>`
+   - `MODE=dict` or `MODE=counters` (match how the old script counted)
+   - `PRELOADS=( … )` — the lib files in load order, lifted from the old script
+   - `SUITES=( "name:lib/<x>/tests/<file>:(<run-expr>)" … )` — one per suite, with the
+     exact run expression the old script used
+   - If counters mode needs counter definitions, add a small `test-harness.sx` preload
+     (author it with `sx_write_file`).
+6. **Replace `lib/<x>/conformance.sh`** with the 3-line shim:
+   ```bash
+   #!/usr/bin/env bash
+   # Thin wrapper — see lib/guest/conformance.sh and lib/<x>/conformance.conf.
+   exec bash "$(dirname "$0")/../guest/conformance.sh" "$(dirname "$0")/conformance.conf" "$@"
+   ```
+7. **Verify parity:** `timeout 600 bash lib/<x>/conformance.sh` again. Read
+   `scoreboard.json`. The pass/total MUST equal the baseline (a *higher* count is only
+   acceptable if you can explain it — e.g. the old extractor under-counted, as happened
+   with apl's `pipeline`; document it in the commit). Any mismatch/error → **revert**
+   (step: rails) and record the blocker.
+8. **Commit** on `loops/conformance`:
+   `conformance: migrate <x> onto shared driver (<mode>, <pass>/<total> parity)`
+   then `git push origin loops/conformance`.
+9. **Update** this file: tick the checklist box and add one dated line to the Progress
+   log (newest first). Then stop.
+
+If a candidate is genuinely blocked (driver lacks a needed mode/feature), record it under
+Blocked with specifics and move to the next candidate next iteration.
+
+## Checklist
+
+- [x] common-lisp — migrated 487/487 (counters; driver extended for per-suite counters+preloads)
+- [x] erlang — migrated 761/761 (dict; pass/count → :failed = count-pass)
+- [x] feed — migrated 189/189 (counters; test-harness.sx preload for counters+helper)
+- [~] forth — excluded: foreign Forth corpus (Hayes core.fr) via awk+python preprocessing
+- [x] go — migrated 609/609 (dict; pass/count → :failed = count-pass, like erlang)
+- [~] js — excluded: foreign test262 .js fixtures vs .expected files (python escape, substring match)
+- [~] ocaml — excluded: scrapes lib/ocaml/test.sh (per-assertion epoch runner) + foreign .ml baseline
+- [~] smalltalk — excluded: scrapes lib/smalltalk/test.sh + walks foreign *.st corpus (per briefing)
+- [~] tcl — excluded: foreign *.tcl programs vs `# expected:` annotations (python escape, bash compare)
+
+(Mark `[x] <x> — migrated N/N` or `[~] <x> — excluded: <reason>` or
+`[!] <x> — blocked: <reason>`.)
+
+## Progress log (newest first)
+
+- 2026-06-07 — tcl: EXCLUDED (foreign-runner, like lua/js/forth) — and WORKLIST COMPLETE.
+  conformance.sh walks foreign lib/tcl/tests/programs/*.tcl files, reads each first line's
+  `# expected: VALUE` annotation, uses python3 to escape the Tcl source into an SX helper,
+  evaluates via (tcl-eval-string …), and string-compares got vs expected in bash. No SX
+  test suites, no SX counter/dict scoreboard — the driver can't drive a
+  foreign-program-vs-expected-annotation harness. Left conformance.sh untouched. Not migrated.
+  >>> A1 worklist now fully classified: 4 migrated (common-lisp, erlang, feed, go),
+  5 excluded as foreign runners (forth, js, ocaml, smalltalk, tcl). Loop done.
+- 2026-06-07 — smalltalk: EXCLUDED (the briefing's own classification example —
+  "smalltalk runs *.st via test.sh"). conformance.sh catalogs foreign
+  lib/smalltalk/tests/programs/*.st programs, runs `bash lib/smalltalk/test.sh -v`, and
+  scrapes its output (final "OK 403/403" summary + per-file pass counts via awk). It loads
+  no SX test suites directly and emits no SX counter/dict scoreboard — the bash layer
+  derives all numbers by text-scraping test.sh. Same "scrapes a test.sh" exclusion as
+  ocaml/lua. Left conformance.sh untouched. Not migrated.
+- 2026-06-07 — ocaml: EXCLUDED (scrapes a test.sh — the briefing's named exclusion
+  criterion). conformance.sh runs `bash lib/ocaml/test.sh -v`, scrapes its human-readable
+  ok/FAIL lines, and re-classifies each test into suites via bash description-matching
+  heuristics; it also scrapes `lib/ocaml/baseline/run.sh` (foreign .ml programs). The
+  underlying test.sh is a per-assertion epoch runner — hundreds of individual
+  (ocaml-test-...) evals, one epoch each, with NO suite-level counter variables or dict
+  runners — so there's nothing the driver's counter/dict-scoreboard model can point at
+  without a full rewrite of the test harness. test.sh's own header notes it "Mirrors
+  lib/lua/test.sh" (the canonical excluded case). Left conformance.sh untouched. Not migrated.
+- 2026-06-07 — js: EXCLUDED (foreign-runner, like lua/forth/smalltalk). conformance.sh
+  walks lib/js/test262-slice/**/*.js (foreign test262 fixtures), reads each .js + its
+  sibling .expected file, escapes the JS source with python3, evaluates via (js-eval),
+  and compares output to .expected by substring match — counting pass/fail in bash against
+  a ≥50% target. It loads no SX test suites and emits no SX counter/dict scoreboard (no
+  scoreboard.json at all). The shared driver only epoch-loads SX preloads + evals SX test
+  suites; it can't drive a foreign-fixture-vs-expected comparison harness. Left
+  conformance.sh untouched. Not migrated.
+- 2026-06-07 — go: migrated to `MODE=dict`, 609/609 exact parity (lex 129, parse 179,
+  types 102, eval 106, runtime 40, stdlib 41, e2e 12). Same shape as erlang — one-session
+  load, per-suite pass + *count* (total) counters — so each suite's dict-literal runner
+  computes `:failed (- count pass)`. No driver change; conformance.conf + shim only.
+  Kept historical scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
+- 2026-06-07 — forth: EXCLUDED (foreign-runner, like lua/smalltalk). Its conformance.sh
+  reads a foreign Forth corpus (lib/forth/ans-tests/core.fr, the gerryjackson Hayes Core
+  suite), preprocesses it with awk (strip `\` / `( )` comments + TESTING lines), splits it
+  into `}T` chunks via an external python3 script that generates a chunks.sx of raw source
+  strings, then runs them through the interpreter via (hayes-run-all) → {:pass :fail :error
+  :total}. The shared driver only epoch-loads SX preloads + evals SX test suites; it can't
+  reproduce the awk+python preprocessing of a foreign .fr corpus. No SX `tests/*.sx` suites
+  exist to point the driver at. Left conformance.sh untouched. Not migrated.
+- 2026-06-07 — feed: migrated to `MODE=counters`, 189/189 exact parity (basic 30,
+  fanout 29, rank 24, integration 22, content 15, notify 8, home 6, dedupe 9, trending 11,
+  mute 9, page 14, thread 12). Canonical counters shape: fresh session per suite, shared
+  preloads, single feed-test-pass/feed-test-fail pair. Lifted the old script's inline
+  epoch-2 counter+helper defs into lib/feed/test-harness.sx (preloaded last). No driver
+  change — only conformance.conf + test-harness.sx + shim. Kept historical scoreboard
+  schema (suites{name:{pass,fail}}, total_pass/total_fail/total).
+- 2026-06-07 — erlang: migrated to `MODE=dict`, 761/761 exact parity (tokenize 62,
+  parse 52, eval 408, runtime 93, ring 4, ping-pong 4, bank 8, echo 7, fib 8, ffi 37,
+  vm 78). Erlang exposes pass + *count* (total) counters, not pass/fail, so each suite's
+  dict-literal runner computes `:failed (- count pass)`. Loads in one session (matches
+  dict mode), so no driver change needed — only conformance.conf + shim. Kept historical
+  scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
+- 2026-06-07 — common-lisp: UNBLOCKED + migrated. Extended the shared driver's
+  `MODE=counters` (lib/guest/conformance.sh) with a backward-compatible SUITES format
+  `name:file[:pass-var:fail-var[:extra-preload ...]]` — optional per-suite counter
+  symbols and per-suite preload chains. Authored lib/common-lisp/conformance.conf (12
+  suites, 8 distinct counter pairs, per-suite preloads, base PRELOADS=stdlib+prefix;
+  kept historical scoreboard schema) and replaced conformance.sh with the shim.
+  Result 487/487 (0 fail) — HIGHER than the 305/0 baseline, explained: the old script's
+  per-suite `timeout 30` was too tight for the slow `eval` suite (~15–25s under
+  contention), silently recording it as 0; the driver's 180s budget recovers its true
+  182. geometry/mop-trace remain 0/0 (pre-existing `refl-class-chain-depth-with` load
+  error; counter vars defined as 0 → clean gc-result, no fail-fallback). Regression:
+  haskell backward-compat path verified (fib/sieve/quicksort 2/2/5, matches committed).
+- 2026-06-07 — common-lisp: classified migratable-in-kind (SX suites over epoch) but
+  BLOCKED on driver feature gaps. Baseline `bash lib/common-lisp/conformance.sh` =
+  305 passed / 0 failed across 12 suites (3 — evaluator/geometry/mop-trace — already
+  emit 0/0, a pre-existing extraction quirk). Not a foreign runner, so not Excluded.
+  Did NOT migrate (parity unachievable under current modes); left conformance.sh
+  untouched. See Blocked. Driver left unchanged (out of strict per-iteration scope).
+
+## Blocked
+
+- (none)
+
+## Resolved blockers
+
+- **common-lisp** (resolved 2026-06-07) — needed per-suite counter names + per-suite
+  preload chains, unsupported by the original `MODE=counters` (single global counter +
+  fixed PRELOADS). Resolved by extending the shared driver: `MODE=counters` now accepts
+  `name:file[:pass-var:fail-var[:extra-preload ...]]` (backward-compatible). **This same
+  extension is available to later candidates** — erlang/forth/etc. with per-suite
+  counter names or preload chains can now migrate via the extended format instead of
+  blocking.