W14: F10 expected-failures baseline gate (test-only)

The OCaml suite's permanent ~273-failure band (in-progress hs-* + the
r7rs radix shadow) is normalized, so real regressions hide in red noise
(conformance.md F-10). A runner skip-list would rewrite the hs loops'
scoreboards mid-flight — instead, pin the band:

scripts/test-suite-baseline.sh runs the full suite and diffs its FAIL set
against spec/tests/known-failures.txt (273 entries, identity =
"suite > name", error text stripped). Red on a NEW failure (regression)
AND red on a vanished failure (fix landed — delete it from the baseline,
locking in the win). The band still prints as FAIL lines for the teams
working through it; nothing in the runner changes.

Bonus capture: 2 of the 273 have EMPTY suite labels (can-map-an-array,
string->number) — live evidence for C9, the next checklist item.

Validated end-to-end: GREEN on current tree (5800p/273f — 38 net passes
above dc7aa709's 5762 from this loop's added pins). Runtime ~12 min.

Test-only: no semantics edits, no push.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-04 04:10:55 +00:00
parent ca4ad404f1
commit 8ba68e0365
3 changed files with 350 additions and 1 deletions

61
scripts/test-suite-baseline.sh Executable file
View File

@@ -0,0 +1,61 @@
#!/bin/bash
# test-suite-baseline.sh — W14/F10: make FAIL mean something again.
#
# The review (conformance.md F-10): the OCaml suite is not green — a
# permanent ~274-failure band (in-progress hs-* + r7rs radix shadow) is
# normalized, so real regressions hide inside the red noise and nobody can
# tell a new failure from the band.
#
# This gate pins the band instead of ignoring it: the full suite's FAIL
# set is diffed against the checked-in baseline
# (spec/tests/known-failures.txt). Two red conditions, both loud:
# NEW failure -> a real regression: fix it (or, if intentional,
# justify + add to the baseline in the same commit)
# VANISHED failure -> something got fixed: delete it from the baseline
# so the win is locked in
# Neither touches the runner or the hs loops' scoreboards — the band still
# prints as FAIL lines for the teams working through it.
#
# Usage: bash scripts/test-suite-baseline.sh
# Runtime: full suite, ~515 min. Exit 0 = fail set identical to baseline.
set -uo pipefail
cd "$(dirname "$0")/.."
RUNNER=hosts/ocaml/_build/default/bin/run_tests.exe
BASELINE=spec/tests/known-failures.txt
[[ -x "$RUNNER" ]] || { echo "SKIP: $RUNNER not built" >&2; exit 2; }
[[ -f "$BASELINE" ]] || { echo "SKIP: $BASELINE missing" >&2; exit 2; }
log=$(mktemp)
timeout 3000 "$RUNNER" > "$log" 2>&1
rc=$?
if [[ $rc -ne 0 && $rc -ne 1 ]]; then
echo "RED: runner exited $rc (timeout/crash)"; tail -5 "$log"; rm -f "$log"; exit 1
fi
# Normalize: keep the stable test identity (suite > name), drop messages
# (error text may contain addresses/timings that churn).
current=$(mktemp)
grep '^ FAIL: ' "$log" | sed 's/^ FAIL: //; s/: .*$//' | sort -u > "$current"
new_failures=$(comm -13 <(sort -u "$BASELINE") "$current")
vanished=$(comm -23 <(sort -u "$BASELINE") "$current")
summary=$(grep '^Results:' "$log" | tail -1)
red=0
if [[ -n "$new_failures" ]]; then
echo "RED: NEW failures not in baseline:"
sed 's/^/ + /' <<<"$new_failures"
red=1
fi
if [[ -n "$vanished" ]]; then
echo "RED: baseline entries now PASSING (delete them from $BASELINE):"
sed 's/^/ - /' <<<"$vanished"
red=1
fi
if [[ $red -eq 0 ]]; then
echo "GREEN: fail set identical to baseline ($(wc -l < "$BASELINE") known failures)"
fi
echo "$summary"
rm -f "$log" "$current"
exit $red