Files
rose-ash/plans/agent-briefings/conformance-loop.md
giles 0061db393c
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 55s
conformance: exclude tcl (foreign *.tcl programs vs expected annotations) — A1 worklist complete
tcl conformance.sh walks foreign lib/tcl/tests/programs/*.tcl files, reads each
first line's '# expected: VALUE' annotation, uses python3 to escape the Tcl
source into an SX helper, evaluates via (tcl-eval-string ...), and string-compares
got vs expected in bash. No SX test suites and no SX counter/dict scoreboard, so
the shared driver can't drive it (same category as lua/js/forth). Left
conformance.sh untouched; recorded the exclusion.

This completes the A1 worklist: 4 migrated onto the shared driver (common-lisp,
erlang, feed, go) and 5 excluded as foreign runners (forth, js, ocaml,
smalltalk, tcl).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 13:03:45 +00:00

193 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# A1 conformance-driver migration loop
Role: migrate every remaining subsystem that hand-rolls its own `conformance.sh`
onto the **shared conformance driver** (`lib/guest/conformance.sh` + `lib/guest/conformance.sx`),
one subsystem per iteration, **verifying test-count parity before every commit**.
This executes item **A1** from the radar backlog (`plans/abstractions.md`, read-only
context). You are an implementer, not a scout.
You are on branch `loops/conformance`, worktree `/root/rose-ash-loops/conformance`.
## Hard safety rails (read every time)
- **NEVER push to `main` or `architecture`.** Push only to `origin/loops/conformance`.
- **NEVER `pkill`/`kill` `sx_server` or any shared process** — sibling loops share the
binary. Bound every test run with `timeout` (e.g. `timeout 600 bash …`). If a run
hangs, let the timeout end it; never kill globally.
- **One subsystem per iteration, then stop.** No batching.
- **Never commit a regression.** If post-migration test counts don't match the baseline
(or an error appears), REVERT (`git checkout -- lib/<x>/conformance.sh` and
`rm -f lib/<x>/conformance.conf`) and record the blocker — do not commit.
- `.sx` files: use the `sx-tree` MCP tools, never Read/Write/Edit. `.sh`/`.conf`/`.md`
files: normal tools are fine.
- Preserve the `bash lib/<x>/conformance.sh` entry point (the shim keeps it working) so
no other loop is disrupted.
## The candidate worklist
Remaining hand-rolled `conformance.sh` (from radar A1): **common-lisp, erlang, feed,
forth, go, js, ocaml, smalltalk, tcl**. Already migrated (do not touch): acl, apl,
datalog, haskell, mod, prolog. Already excluded (different harness): lua.
Work them roughly simplest-first. Track status in the checklist at the bottom.
## What "fits the driver" means — classify FIRST
The shared driver works for subsystems whose tests are **SX test-suites loaded over the
epoch protocol** and run by an expression that emits a counter/dict scoreboard. It does
NOT fit subsystems that run **foreign source programs** through a separate runner
(e.g. lua walks `*.lua` via Python; smalltalk runs `*.st` via `test.sh`).
Per candidate, before migrating, decide:
- **Migratable** — its `conformance.sh` epoch-loads SX preloads and evals SX test suites
→ proceed to migrate.
- **Excluded** — it shells out to a foreign program runner / scrapes a `test.sh`
DO NOT migrate. Record the exclusion (one line in the checklist + a `git`-free note in
this briefing's Progress log) with the reason, and move on. Excluding is a valid,
honest result — a forced migration that loses coverage is worse than none.
## Per-iteration procedure
1. **Pick** the next `[ ]` candidate in the checklist.
2. **Read** its `lib/<x>/conformance.sh` in full. Read the two recipe templates —
`lib/haskell/conformance.conf` (MODE=counters) and `lib/prolog/conformance.conf`
(MODE=dict) — and skim `lib/guest/conformance.sh` + `lib/guest/conformance.sx`.
3. **Classify** (above). If Excluded → record reason, tick as excluded, stop.
4. **Baseline:** `timeout 600 bash lib/<x>/conformance.sh`, then read
`lib/<x>/scoreboard.json` and record the pass/total. This is the parity target.
5. **Author `lib/<x>/conformance.conf`:**
- `LANG_NAME=<x>`
- `MODE=dict` or `MODE=counters` (match how the old script counted)
- `PRELOADS=( … )` — the lib files in load order, lifted from the old script
- `SUITES=( "name:lib/<x>/tests/<file>:(<run-expr>)" … )` — one per suite, with the
exact run expression the old script used
- If counters mode needs counter definitions, add a small `test-harness.sx` preload
(author it with `sx_write_file`).
6. **Replace `lib/<x>/conformance.sh`** with the 3-line shim:
```bash
#!/usr/bin/env bash
# Thin wrapper — see lib/guest/conformance.sh and lib/<x>/conformance.conf.
exec bash "$(dirname "$0")/../guest/conformance.sh" "$(dirname "$0")/conformance.conf" "$@"
```
7. **Verify parity:** `timeout 600 bash lib/<x>/conformance.sh` again. Read
`scoreboard.json`. The pass/total MUST equal the baseline (a *higher* count is only
acceptable if you can explain it — e.g. the old extractor under-counted, as happened
with apl's `pipeline`; document it in the commit). Any mismatch/error → **revert**
(step: rails) and record the blocker.
8. **Commit** on `loops/conformance`:
`conformance: migrate <x> onto shared driver (<mode>, <pass>/<total> parity)`
then `git push origin loops/conformance`.
9. **Update** this file: tick the checklist box and add one dated line to the Progress
log (newest first). Then stop.
If a candidate is genuinely blocked (driver lacks a needed mode/feature), record it under
Blocked with specifics and move to the next candidate next iteration.
## Checklist
- [x] common-lisp — migrated 487/487 (counters; driver extended for per-suite counters+preloads)
- [x] erlang — migrated 761/761 (dict; pass/count → :failed = count-pass)
- [x] feed — migrated 189/189 (counters; test-harness.sx preload for counters+helper)
- [~] forth — excluded: foreign Forth corpus (Hayes core.fr) via awk+python preprocessing
- [x] go — migrated 609/609 (dict; pass/count → :failed = count-pass, like erlang)
- [~] js — excluded: foreign test262 .js fixtures vs .expected files (python escape, substring match)
- [~] ocaml — excluded: scrapes lib/ocaml/test.sh (per-assertion epoch runner) + foreign .ml baseline
- [~] smalltalk — excluded: scrapes lib/smalltalk/test.sh + walks foreign *.st corpus (per briefing)
- [~] tcl — excluded: foreign *.tcl programs vs `# expected:` annotations (python escape, bash compare)
(Mark `[x] <x> — migrated N/N` or `[~] <x> — excluded: <reason>` or
`[!] <x> — blocked: <reason>`.)
## Progress log (newest first)
- 2026-06-07 — tcl: EXCLUDED (foreign-runner, like lua/js/forth) — and WORKLIST COMPLETE.
conformance.sh walks foreign lib/tcl/tests/programs/*.tcl files, reads each first line's
`# expected: VALUE` annotation, uses python3 to escape the Tcl source into an SX helper,
evaluates via (tcl-eval-string …), and string-compares got vs expected in bash. No SX
test suites, no SX counter/dict scoreboard — the driver can't drive a
foreign-program-vs-expected-annotation harness. Left conformance.sh untouched. Not migrated.
>>> A1 worklist now fully classified: 4 migrated (common-lisp, erlang, feed, go),
5 excluded as foreign runners (forth, js, ocaml, smalltalk, tcl). Loop done.
- 2026-06-07 — smalltalk: EXCLUDED (the briefing's own classification example —
"smalltalk runs *.st via test.sh"). conformance.sh catalogs foreign
lib/smalltalk/tests/programs/*.st programs, runs `bash lib/smalltalk/test.sh -v`, and
scrapes its output (final "OK 403/403" summary + per-file pass counts via awk). It loads
no SX test suites directly and emits no SX counter/dict scoreboard — the bash layer
derives all numbers by text-scraping test.sh. Same "scrapes a test.sh" exclusion as
ocaml/lua. Left conformance.sh untouched. Not migrated.
- 2026-06-07 — ocaml: EXCLUDED (scrapes a test.sh — the briefing's named exclusion
criterion). conformance.sh runs `bash lib/ocaml/test.sh -v`, scrapes its human-readable
ok/FAIL lines, and re-classifies each test into suites via bash description-matching
heuristics; it also scrapes `lib/ocaml/baseline/run.sh` (foreign .ml programs). The
underlying test.sh is a per-assertion epoch runner — hundreds of individual
(ocaml-test-...) evals, one epoch each, with NO suite-level counter variables or dict
runners — so there's nothing the driver's counter/dict-scoreboard model can point at
without a full rewrite of the test harness. test.sh's own header notes it "Mirrors
lib/lua/test.sh" (the canonical excluded case). Left conformance.sh untouched. Not migrated.
- 2026-06-07 — js: EXCLUDED (foreign-runner, like lua/forth/smalltalk). conformance.sh
walks lib/js/test262-slice/**/*.js (foreign test262 fixtures), reads each .js + its
sibling .expected file, escapes the JS source with python3, evaluates via (js-eval),
and compares output to .expected by substring match — counting pass/fail in bash against
a ≥50% target. It loads no SX test suites and emits no SX counter/dict scoreboard (no
scoreboard.json at all). The shared driver only epoch-loads SX preloads + evals SX test
suites; it can't drive a foreign-fixture-vs-expected comparison harness. Left
conformance.sh untouched. Not migrated.
- 2026-06-07 — go: migrated to `MODE=dict`, 609/609 exact parity (lex 129, parse 179,
types 102, eval 106, runtime 40, stdlib 41, e2e 12). Same shape as erlang — one-session
load, per-suite pass + *count* (total) counters — so each suite's dict-literal runner
computes `:failed (- count pass)`. No driver change; conformance.conf + shim only.
Kept historical scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
- 2026-06-07 — forth: EXCLUDED (foreign-runner, like lua/smalltalk). Its conformance.sh
reads a foreign Forth corpus (lib/forth/ans-tests/core.fr, the gerryjackson Hayes Core
suite), preprocesses it with awk (strip `\` / `( )` comments + TESTING lines), splits it
into `}T` chunks via an external python3 script that generates a chunks.sx of raw source
strings, then runs them through the interpreter via (hayes-run-all) → {:pass :fail :error
:total}. The shared driver only epoch-loads SX preloads + evals SX test suites; it can't
reproduce the awk+python preprocessing of a foreign .fr corpus. No SX `tests/*.sx` suites
exist to point the driver at. Left conformance.sh untouched. Not migrated.
- 2026-06-07 — feed: migrated to `MODE=counters`, 189/189 exact parity (basic 30,
fanout 29, rank 24, integration 22, content 15, notify 8, home 6, dedupe 9, trending 11,
mute 9, page 14, thread 12). Canonical counters shape: fresh session per suite, shared
preloads, single feed-test-pass/feed-test-fail pair. Lifted the old script's inline
epoch-2 counter+helper defs into lib/feed/test-harness.sx (preloaded last). No driver
change — only conformance.conf + test-harness.sx + shim. Kept historical scoreboard
schema (suites{name:{pass,fail}}, total_pass/total_fail/total).
- 2026-06-07 — erlang: migrated to `MODE=dict`, 761/761 exact parity (tokenize 62,
parse 52, eval 408, runtime 93, ring 4, ping-pong 4, bank 8, echo 7, fib 8, ffi 37,
vm 78). Erlang exposes pass + *count* (total) counters, not pass/fail, so each suite's
dict-literal runner computes `:failed (- count pass)`. Loads in one session (matches
dict mode), so no driver change needed — only conformance.conf + shim. Kept historical
scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
- 2026-06-07 — common-lisp: UNBLOCKED + migrated. Extended the shared driver's
`MODE=counters` (lib/guest/conformance.sh) with a backward-compatible SUITES format
`name:file[:pass-var:fail-var[:extra-preload ...]]` — optional per-suite counter
symbols and per-suite preload chains. Authored lib/common-lisp/conformance.conf (12
suites, 8 distinct counter pairs, per-suite preloads, base PRELOADS=stdlib+prefix;
kept historical scoreboard schema) and replaced conformance.sh with the shim.
Result 487/487 (0 fail) — HIGHER than the 305/0 baseline, explained: the old script's
per-suite `timeout 30` was too tight for the slow `eval` suite (~1525s under
contention), silently recording it as 0; the driver's 180s budget recovers its true
182. geometry/mop-trace remain 0/0 (pre-existing `refl-class-chain-depth-with` load
error; counter vars defined as 0 → clean gc-result, no fail-fallback). Regression:
haskell backward-compat path verified (fib/sieve/quicksort 2/2/5, matches committed).
- 2026-06-07 — common-lisp: classified migratable-in-kind (SX suites over epoch) but
BLOCKED on driver feature gaps. Baseline `bash lib/common-lisp/conformance.sh` =
305 passed / 0 failed across 12 suites (3 — evaluator/geometry/mop-trace — already
emit 0/0, a pre-existing extraction quirk). Not a foreign runner, so not Excluded.
Did NOT migrate (parity unachievable under current modes); left conformance.sh
untouched. See Blocked. Driver left unchanged (out of strict per-iteration scope).
## Blocked
- (none)
## Resolved blockers
- **common-lisp** (resolved 2026-06-07) — needed per-suite counter names + per-suite
preload chains, unsupported by the original `MODE=counters` (single global counter +
fixed PRELOADS). Resolved by extending the shared driver: `MODE=counters` now accepts
`name:file[:pass-var:fail-var[:extra-preload ...]]` (backward-compatible). **This same
extension is available to later candidates** — erlang/forth/etc. with per-suite
counter names or preload chains can now migrate via the extended format instead of
blocking.