Files
rose-ash/plans/agent-briefings/conformance-loop.md
giles 0061db393c
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 55s
conformance: exclude tcl (foreign *.tcl programs vs expected annotations) — A1 worklist complete
tcl conformance.sh walks foreign lib/tcl/tests/programs/*.tcl files, reads each
first line's '# expected: VALUE' annotation, uses python3 to escape the Tcl
source into an SX helper, evaluates via (tcl-eval-string ...), and string-compares
got vs expected in bash. No SX test suites and no SX counter/dict scoreboard, so
the shared driver can't drive it (same category as lua/js/forth). Left
conformance.sh untouched; recorded the exclusion.

This completes the A1 worklist: 4 migrated onto the shared driver (common-lisp,
erlang, feed, go) and 5 excluded as foreign runners (forth, js, ocaml,
smalltalk, tcl).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 13:03:45 +00:00

13 KiB
Raw Blame History

A1 conformance-driver migration loop

Role: migrate every remaining subsystem that hand-rolls its own conformance.sh onto the shared conformance driver (lib/guest/conformance.sh + lib/guest/conformance.sx), one subsystem per iteration, verifying test-count parity before every commit. This executes item A1 from the radar backlog (plans/abstractions.md, read-only context). You are an implementer, not a scout.

You are on branch loops/conformance, worktree /root/rose-ash-loops/conformance.

Hard safety rails (read every time)

  • NEVER push to main or architecture. Push only to origin/loops/conformance.
  • NEVER pkill/kill sx_server or any shared process — sibling loops share the binary. Bound every test run with timeout (e.g. timeout 600 bash …). If a run hangs, let the timeout end it; never kill globally.
  • One subsystem per iteration, then stop. No batching.
  • Never commit a regression. If post-migration test counts don't match the baseline (or an error appears), REVERT (git checkout -- lib/<x>/conformance.sh and rm -f lib/<x>/conformance.conf) and record the blocker — do not commit.
  • .sx files: use the sx-tree MCP tools, never Read/Write/Edit. .sh/.conf/.md files: normal tools are fine.
  • Preserve the bash lib/<x>/conformance.sh entry point (the shim keeps it working) so no other loop is disrupted.

The candidate worklist

Remaining hand-rolled conformance.sh (from radar A1): common-lisp, erlang, feed, forth, go, js, ocaml, smalltalk, tcl. Already migrated (do not touch): acl, apl, datalog, haskell, mod, prolog. Already excluded (different harness): lua.

Work them roughly simplest-first. Track status in the checklist at the bottom.

What "fits the driver" means — classify FIRST

The shared driver works for subsystems whose tests are SX test-suites loaded over the epoch protocol and run by an expression that emits a counter/dict scoreboard. It does NOT fit subsystems that run foreign source programs through a separate runner (e.g. lua walks *.lua via Python; smalltalk runs *.st via test.sh).

Per candidate, before migrating, decide:

  • Migratable — its conformance.sh epoch-loads SX preloads and evals SX test suites → proceed to migrate.
  • Excluded — it shells out to a foreign program runner / scrapes a test.sh → DO NOT migrate. Record the exclusion (one line in the checklist + a git-free note in this briefing's Progress log) with the reason, and move on. Excluding is a valid, honest result — a forced migration that loses coverage is worse than none.

Per-iteration procedure

  1. Pick the next [ ] candidate in the checklist.
  2. Read its lib/<x>/conformance.sh in full. Read the two recipe templates — lib/haskell/conformance.conf (MODE=counters) and lib/prolog/conformance.conf (MODE=dict) — and skim lib/guest/conformance.sh + lib/guest/conformance.sx.
  3. Classify (above). If Excluded → record reason, tick as excluded, stop.
  4. Baseline: timeout 600 bash lib/<x>/conformance.sh, then read lib/<x>/scoreboard.json and record the pass/total. This is the parity target.
  5. Author lib/<x>/conformance.conf:
    • LANG_NAME=<x>
    • MODE=dict or MODE=counters (match how the old script counted)
    • PRELOADS=( … ) — the lib files in load order, lifted from the old script
    • SUITES=( "name:lib/<x>/tests/<file>:(<run-expr>)" … ) — one per suite, with the exact run expression the old script used
    • If counters mode needs counter definitions, add a small test-harness.sx preload (author it with sx_write_file).
  6. Replace lib/<x>/conformance.sh with the 3-line shim:
    #!/usr/bin/env bash
    # Thin wrapper — see lib/guest/conformance.sh and lib/<x>/conformance.conf.
    exec bash "$(dirname "$0")/../guest/conformance.sh" "$(dirname "$0")/conformance.conf" "$@"
    
  7. Verify parity: timeout 600 bash lib/<x>/conformance.sh again. Read scoreboard.json. The pass/total MUST equal the baseline (a higher count is only acceptable if you can explain it — e.g. the old extractor under-counted, as happened with apl's pipeline; document it in the commit). Any mismatch/error → revert (step: rails) and record the blocker.
  8. Commit on loops/conformance: conformance: migrate <x> onto shared driver (<mode>, <pass>/<total> parity) then git push origin loops/conformance.
  9. Update this file: tick the checklist box and add one dated line to the Progress log (newest first). Then stop.

If a candidate is genuinely blocked (driver lacks a needed mode/feature), record it under Blocked with specifics and move to the next candidate next iteration.

Checklist

  • common-lisp — migrated 487/487 (counters; driver extended for per-suite counters+preloads)
  • erlang — migrated 761/761 (dict; pass/count → :failed = count-pass)
  • feed — migrated 189/189 (counters; test-harness.sx preload for counters+helper)
  • [~] forth — excluded: foreign Forth corpus (Hayes core.fr) via awk+python preprocessing
  • go — migrated 609/609 (dict; pass/count → :failed = count-pass, like erlang)
  • [~] js — excluded: foreign test262 .js fixtures vs .expected files (python escape, substring match)
  • [~] ocaml — excluded: scrapes lib/ocaml/test.sh (per-assertion epoch runner) + foreign .ml baseline
  • [~] smalltalk — excluded: scrapes lib/smalltalk/test.sh + walks foreign *.st corpus (per briefing)
  • [~] tcl — excluded: foreign *.tcl programs vs # expected: annotations (python escape, bash compare)

(Mark [x] <x> — migrated N/N or [~] <x> — excluded: <reason> or [!] <x> — blocked: <reason>.)

Progress log (newest first)

  • 2026-06-07 — tcl: EXCLUDED (foreign-runner, like lua/js/forth) — and WORKLIST COMPLETE. conformance.sh walks foreign lib/tcl/tests/programs/*.tcl files, reads each first line's # expected: VALUE annotation, uses python3 to escape the Tcl source into an SX helper, evaluates via (tcl-eval-string …), and string-compares got vs expected in bash. No SX test suites, no SX counter/dict scoreboard — the driver can't drive a foreign-program-vs-expected-annotation harness. Left conformance.sh untouched. Not migrated.

    A1 worklist now fully classified: 4 migrated (common-lisp, erlang, feed, go), 5 excluded as foreign runners (forth, js, ocaml, smalltalk, tcl). Loop done.

  • 2026-06-07 — smalltalk: EXCLUDED (the briefing's own classification example — "smalltalk runs .st via test.sh"). conformance.sh catalogs foreign lib/smalltalk/tests/programs/.st programs, runs bash lib/smalltalk/test.sh -v, and scrapes its output (final "OK 403/403" summary + per-file pass counts via awk). It loads no SX test suites directly and emits no SX counter/dict scoreboard — the bash layer derives all numbers by text-scraping test.sh. Same "scrapes a test.sh" exclusion as ocaml/lua. Left conformance.sh untouched. Not migrated.
  • 2026-06-07 — ocaml: EXCLUDED (scrapes a test.sh — the briefing's named exclusion criterion). conformance.sh runs bash lib/ocaml/test.sh -v, scrapes its human-readable ok/FAIL lines, and re-classifies each test into suites via bash description-matching heuristics; it also scrapes lib/ocaml/baseline/run.sh (foreign .ml programs). The underlying test.sh is a per-assertion epoch runner — hundreds of individual (ocaml-test-...) evals, one epoch each, with NO suite-level counter variables or dict runners — so there's nothing the driver's counter/dict-scoreboard model can point at without a full rewrite of the test harness. test.sh's own header notes it "Mirrors lib/lua/test.sh" (the canonical excluded case). Left conformance.sh untouched. Not migrated.
  • 2026-06-07 — js: EXCLUDED (foreign-runner, like lua/forth/smalltalk). conformance.sh walks lib/js/test262-slice/**/*.js (foreign test262 fixtures), reads each .js + its sibling .expected file, escapes the JS source with python3, evaluates via (js-eval), and compares output to .expected by substring match — counting pass/fail in bash against a ≥50% target. It loads no SX test suites and emits no SX counter/dict scoreboard (no scoreboard.json at all). The shared driver only epoch-loads SX preloads + evals SX test suites; it can't drive a foreign-fixture-vs-expected comparison harness. Left conformance.sh untouched. Not migrated.
  • 2026-06-07 — go: migrated to MODE=dict, 609/609 exact parity (lex 129, parse 179, types 102, eval 106, runtime 40, stdlib 41, e2e 12). Same shape as erlang — one-session load, per-suite pass + count (total) counters — so each suite's dict-literal runner computes :failed (- count pass). No driver change; conformance.conf + shim only. Kept historical scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
  • 2026-06-07 — forth: EXCLUDED (foreign-runner, like lua/smalltalk). Its conformance.sh reads a foreign Forth corpus (lib/forth/ans-tests/core.fr, the gerryjackson Hayes Core suite), preprocesses it with awk (strip \ / ( ) comments + TESTING lines), splits it into }T chunks via an external python3 script that generates a chunks.sx of raw source strings, then runs them through the interpreter via (hayes-run-all) → {:pass :fail :error :total}. The shared driver only epoch-loads SX preloads + evals SX test suites; it can't reproduce the awk+python preprocessing of a foreign .fr corpus. No SX tests/*.sx suites exist to point the driver at. Left conformance.sh untouched. Not migrated.
  • 2026-06-07 — feed: migrated to MODE=counters, 189/189 exact parity (basic 30, fanout 29, rank 24, integration 22, content 15, notify 8, home 6, dedupe 9, trending 11, mute 9, page 14, thread 12). Canonical counters shape: fresh session per suite, shared preloads, single feed-test-pass/feed-test-fail pair. Lifted the old script's inline epoch-2 counter+helper defs into lib/feed/test-harness.sx (preloaded last). No driver change — only conformance.conf + test-harness.sx + shim. Kept historical scoreboard schema (suites{name:{pass,fail}}, total_pass/total_fail/total).
  • 2026-06-07 — erlang: migrated to MODE=dict, 761/761 exact parity (tokenize 62, parse 52, eval 408, runtime 93, ring 4, ping-pong 4, bank 8, echo 7, fib 8, ffi 37, vm 78). Erlang exposes pass + count (total) counters, not pass/fail, so each suite's dict-literal runner computes :failed (- count pass). Loads in one session (matches dict mode), so no driver change needed — only conformance.conf + shim. Kept historical scoreboard schema (language/total_pass/total/suites[name,pass,total,status]).
  • 2026-06-07 — common-lisp: UNBLOCKED + migrated. Extended the shared driver's MODE=counters (lib/guest/conformance.sh) with a backward-compatible SUITES format name:file[:pass-var:fail-var[:extra-preload ...]] — optional per-suite counter symbols and per-suite preload chains. Authored lib/common-lisp/conformance.conf (12 suites, 8 distinct counter pairs, per-suite preloads, base PRELOADS=stdlib+prefix; kept historical scoreboard schema) and replaced conformance.sh with the shim. Result 487/487 (0 fail) — HIGHER than the 305/0 baseline, explained: the old script's per-suite timeout 30 was too tight for the slow eval suite (~1525s under contention), silently recording it as 0; the driver's 180s budget recovers its true 182. geometry/mop-trace remain 0/0 (pre-existing refl-class-chain-depth-with load error; counter vars defined as 0 → clean gc-result, no fail-fallback). Regression: haskell backward-compat path verified (fib/sieve/quicksort 2/2/5, matches committed).
  • 2026-06-07 — common-lisp: classified migratable-in-kind (SX suites over epoch) but BLOCKED on driver feature gaps. Baseline bash lib/common-lisp/conformance.sh = 305 passed / 0 failed across 12 suites (3 — evaluator/geometry/mop-trace — already emit 0/0, a pre-existing extraction quirk). Not a foreign runner, so not Excluded. Did NOT migrate (parity unachievable under current modes); left conformance.sh untouched. See Blocked. Driver left unchanged (out of strict per-iteration scope).

Blocked

  • (none)

Resolved blockers

  • common-lisp (resolved 2026-06-07) — needed per-suite counter names + per-suite preload chains, unsupported by the original MODE=counters (single global counter + fixed PRELOADS). Resolved by extending the shared driver: MODE=counters now accepts name:file[:pass-var:fail-var[:extra-preload ...]] (backward-compatible). This same extension is available to later candidates — erlang/forth/etc. with per-suite counter names or preload chains can now migrate via the extended format instead of blocking.