17 test-only commits delivering the full W14 workstream (PLAN.md §W14 —
the enabler every other sx-review fix verifies against):
- spec/tests/test-gate-pins.sx: 7 pin suites (29 tests) for dc7aa709's
landed fixes — K18, K20, K09/K11/K39, K49 (spec side), crit-2
(non-vacuous via side-effect sentinel), plus C21/C22 harness pins
- 6 gate scripts, all bidirectional ledgers (a healed KNOWN entry also
fails): test-protocol-gate (C1/C1b/S4 + C3-C7 quirk ledger + seeded
fuzz-liveness, 11), test-env-parity (runner-only bindings, 7),
test-harness-parity (mcp_tree vs sx_server, 12), test-wasm-corpus
(shipped kernel: 80/83 files green, 5192 passes), test-suite-baseline
(273-failure band pinned in spec/tests/known-failures.txt),
test-differential (49 probes native vs WASM, 3 ledgered)
- spec/harness.sx: C22 fix (IO logged before the mock runs) + C21
harness-run-perform (real CEK suspend/resume mode); W14-assigned per
PLAN approach item 4 — see merge note in the briefing re: the forge
briefing's stricter wording
- C9: empty suite labels eliminated across 6 test files
- web/tests/test-adapter-dom-render.sx: first render-output coverage of
the DOM adapter (the browser-only exclusion was false)
Confirmed handoffs recorded in the briefing: bare-server apply does not
spread args (F-3, runner masks it); both runners' sha3-256 are fake
stubs (test CIDs != production CIDs); generated sx_render.ml is regen-
stale (misses dc7aa709's HTML_TAGS fix); canonical-serialize broken on
bare server for any number.
Verified post-merge in this checkout: gate pins 275/0, protocol-gate
11/0, env-parity 7/0, harness-parity 12/0, differential 49/0.
Briefing conflict (add/add) resolved: kept the loop's completed version
with a merge note preserving the forge briefing's context (8181421c
landed after the worktree branched).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Committed replacement for the review's ephemeral 130-probe corpus:
spec/tests/differential-probes.txt (49 probes: F-1 int/float display, K18
overflow, F-3 apply + dict order, S-4 float printing, strings,
collections, special forms, error normalization) evaluated on the native
server (epoch protocol printer) and the SHIPPED WASM kernel
(eval_wasm_probes.js via guest sx-serialize), diffed by
scripts/test-differential.sh with a KNOWN_DIVERGENT heal-detecting ledger.
Result: 46/49 agree. All 3 divergences share one root cause, verified
live: bare sx_server's `apply` does not spread its argument list —
(apply + (list 1 2 3)) errors "Expected number, got list", (apply str l)
returns the serialized list; the WASM kernel spreads correctly and the
test runner masks the bug with its own apply binding (F-7 class).
Finding refinement: F-1's float-display divergence (0.3 vs
0.30000000000000004) is a K.eval JS-boundary artifact — guest-serialized
output agrees across hosts; the battery therefore compares guest
serialization.
This completes the W14 checklist: 7 pin suites, 6 gate scripts/runners,
2 harness capabilities, C9 label cleanup, adapter-dom render coverage.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The OCaml suite's permanent ~273-failure band (in-progress hs-* + the
r7rs radix shadow) is normalized, so real regressions hide in red noise
(conformance.md F-10). A runner skip-list would rewrite the hs loops'
scoreboards mid-flight — instead, pin the band:
scripts/test-suite-baseline.sh runs the full suite and diffs its FAIL set
against spec/tests/known-failures.txt (273 entries, identity =
"suite > name", error text stripped). Red on a NEW failure (regression)
AND red on a vanished failure (fix landed — delete it from the baseline,
locking in the win). The band still prints as FAIL lines for the teams
working through it; nothing in the runner changes.
Bonus capture: 2 of the 273 have EMPTY suite labels (can-map-an-array,
string->number) — live evidence for C9, the next checklist item.
Validated end-to-end: GREEN on current tree (5800p/273f — 38 net passes
above dc7aa709's 5762 from this loop's added pins). Runtime ~12 min.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All five protocol quirks are OPEN server-side, so the suite pins CURRENT
behavior (verified live) as a bidirectional ledger in
scripts/test-protocol-gate.sh:
- C3: stray (io-response ...) answered as Unknown command (dead guard)
- C4: malformed (epoch) errors and leaves the epoch stale (envelope
changed since the finding: the dc7aa709 guard answers rather than kills)
- C5: decreasing epoch accepted silently (no monotonic enforcement)
- C6: two commands on one line -> one error, neither executed
- C7: vm-trace without compiler -> opaque "Not callable: nil"
Plus the fuzz property that matters: 60 deterministically-seeded hostile
lines (unbalanced parens, control chars, unicode, 2KB lines, stray
io-responses, epoch mutations) followed by a well-formed command — the
server must still answer and exit cleanly. protocol-gate: 11/11.
When a server-side fix lands, the matching ledger pin fails loudly and the
ledger is updated to assert the corrected behavior.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
conformance.md F-2: no runner fed spec/tests through the shipped
sx_browser.bc.wasm.js — the F-1/F-3 native/WASM divergences existed
undetected because of exactly this gap.
Add hosts/ocaml/browser/run_wasm_corpus.js: boots the shipped kernel
headless in Node (stub block + module preload mirroring
test_wasm_native.js, the blessed boot path), registers the test-framework
hooks, runs ONE test file per process and emits a parseable CORPUS-RESULT
line — process isolation means a hanging file is killed by the driver's
per-file timeout without ending the sweep.
Add scripts/test-wasm-corpus.sh: sweeps spec/tests, applies a SKIP /
KNOWN_FAIL ledger (green-flip on a KNOWN_FAIL fails the run so the ledger
cannot rot), gates on everything else.
Empirical baseline (2026-07-04): 83 files, 80 fully green, 5192 passes,
zero test failures on the shipped kernel — including test-gate-pins
(29/29). KNOWN_FAIL: test-hash-table/test-r7rs/test-sets hit an opaque
jsoo load-error mid-file (22/87/30 tests pass first). Full sweep ~13 min;
sx-build-all.sh wiring deferred to the D3 gate-definition decision.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mcp_tree.ml's parallel primitive table drifted from sx_primitives.ml —
the spec-mandated harness verification path silently produced false
findings ((get {:a 1} :a 99) -> nil vs 1, char-class vs substring split,
etc.). dc7aa709 aligned 8 entries as a stopgap; the real fix (linking
sx_primitives) is hosts-lane.
Add scripts/test-harness-parity.sh: drives mcp_tree.exe sx_eval via raw
JSON-RPC and a fresh sx_server.exe via the epoch protocol, runs the
finding's 12-probe battery through both, fails on any divergence (errors
compared by inner message). 12/12 parity today — the stopgap holds and
can no longer rot silently.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Section-B audit, all verified live over the epoch protocol. Runner-only
bindings absent from production: values, call-with-values (run_tests.ml
:1131/:1140), contains-char? (rt.ml:728 + rt.js:85), trim-right (JS runner
ONLY — absent even from the OCaml runner), sha3-256 (rt.ml:745 + rt.js:88;
production's real primitive is crypto-sha3-256).
Consequences pinned: (canonical-serialize 42) on a fresh server errors
"Undefined symbol: contains-char?" — content addressing broken for ANY
number outside the runners. And BOTH runners' sha3-256 are FAKE stubs
(OCaml: Hashtbl.hash), so every test-computed CID differs from production.
scripts/test-env-parity.sh is a bidirectional ledger: MUST_HAVE bindings
going missing fail; a KNOWN_DRIFT binding APPEARING also fails with
instructions to move it to MUST_HAVE and flip the consequence pin — the
ledger cannot rot silently in either direction. 7/7 green.
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-fix, a routing-failure page was stored in the HTTP response cache as
200 and served byte-identically to every later visitor until restart
(cold 2s -> warm 0.0005s). dc7aa709 made http_render_page return
(html, is_error) and gated cache insertion on `not is_err`.
Extend scripts/test-protocol-gate.sh with an HTTP-mode case: fresh
sx_server.exe --http on a random port (timeout-bounded, own child killed),
GET the same nonexistent path twice, assert both requests re-render (two
[sx-http] render lines) and the "[cache] ... error page, not cached" gate
line appears. Standalone-worktree caveat (all docs pages render as soft
error pages, so no positive cache control) documented in the script.
5/5 protocol-gate green; 267/0 sx gate pins. All seven section-A test-debt
pins now landed (K18, K20, K09/K11/K39, K49, crit-2, C1/C1b, S4).
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-fix, one malformed or non-ASCII line on sx_server's top-level command
channel raised an uncaught Parse_error and killed the whole shared process
(bridges + conformance runners). dc7aa709 guards the parse; the server now
answers (error N "Malformed command line: ...") and keeps serving.
Add scripts/test-protocol-gate.sh: per case, spawn a fresh timeout-bounded
sx_server.exe (never touches a shared process) and assert the error
response, the follow-up epoch still evaluating, and a clean exit. Cases:
C1 unterminated list + garbage line, C1b non-ASCII byte (exact review
repros from plans/sx-review/hosts.md), plus a well-formed control. 4/4
green. Structured to grow into W14 section E's protocol fuzz suite (C3-C7).
Test-only: no semantics edits, no push.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The forge already DRIVES sessions (briefing → tmux launch, sx-fix-up.sh).
This records what comes BACK, making the forge a true system of record:
- sx-fix-writeback.sh <forge-agent> [kind] [base-ref]: reads new commits on
loops/sx-<slug>, appends a record per commit to writeback.sxsrc (idempotent,
matched by sha), then rebuilds the forge + replays them as agentic-sx
commit!s on agents/<forge-agent> and re-dumps forge.sxdata.
- forge-build.sxsrc: fb-writeback-records / fb-replay-writeback / fb-do-writeback
— each real-git commit becomes an agentic-sx commit whose tree is a small
commit.sx pointer (sha/branch/message/files); real git holds the code, the
forge holds the index, so the CID stays small.
- writeback.sxsrc: the append-only record log (source of truth for what's
been recorded); replayed chronologically so agent branch heads advance right.
Verified live: the sx-gate loop's first real commit (f09368e1, "pin K18
expt-overflow float-promotion") is now recorded as a test-kind agentic-sx
commit on agents/ws-W14 (session log: spawn → finding → writeback), its
commit.sx pointing back at the real-git sha.
Loop closed: forge → tmux (drive) and tmux → real-git → forge (record).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First live test of the sx-forge technology driving a real work session:
- sx-fix-up.sh <forge-agent> <briefing.md>: reads the agent's briefing FROM
the rose-ash/sx-review forge (agentic-sx branch), materialises a git
worktree + branch (loops/sx-<slug>), and spins up a tmux+claude session
briefed from the forge. Commits are LOCAL by default (no push).
- sx-fix-down.sh [--clean]: stop the sx-fix session; --clean removes worktrees.
- plans/agent-briefings/sx-gate-loop.md: W14 (test gate) briefing — the safe
first payload (test-only, cannot regress the 5762p/274f baseline), scoped
commit-no-push with hard guardrails.
Verified live: launcher read the W14 briefing from the forge, created worktree
/root/rose-ash-loops/sx-ws-w14 on loops/sx-ws-w14, booted claude, and the agent
picked up the briefing. Watch: tmux a -t sx-fix. Note: MCP servers need /mcp
auth in a fresh worktree (agent works via Bash meanwhile).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replaces the watchdog-bump approach with an automated check. The next 5× (or
worse) substrate regression will trip the alarm at build time instead of
hiding behind a deadline bump and only being noticed weeks later.
Components:
* lib/perf-smoke.sx — four micro-benchmarks chosen for distinct substrate
failure modes: function-call dispatch (fib), env construction (let-chain),
HO-form dispatch + lambda creation (map-sq), TCO + primitive dispatch
(tail-loop). Warm-up pass populates JIT cache before the timed pass so we
measure the steady state.
* scripts/perf-smoke.sh — pipes lib/perf-smoke.sx to sx_server.exe, parses
per-bench wall-time, asserts each is within FACTOR× of the recorded
reference (default 5×). `--update` rewrites the reference in-place.
* scripts/sx-build-all.sh — perf-smoke wired in as a post-step after JS
tests. Hard fail if any benchmark regressed beyond budget.
Reference numbers: minimum across 6 back-to-back runs on this dev machine
under typical concurrent-loop contention (load ~9, 2 vCPU, 7.6 GiB RAM,
OCaml 5.2.0, architecture @ 92f6f187). Documented in
plans/jit-perf-regression.md including how to update them.
The 5× factor is chosen so contention noise (~1–2× variance) doesn't trigger
false alarms but a real ≥5× substrate regression — the kind that motivated
this whole investigation — fails the build immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/extract-upstream-tests.py — new walker that scrapes
/tmp/hs-upstream/test/**/*.js for test('name', ...) patterns. Uses
brace-counting that handles strings, regex, comments, and template
literals. Two modes:
- merge (default): preserves existing test bodies, only adds new tests
- --replace: discards old bodies, fully re-extracts (use when bodies
drift due to upstream cleanup)
Merge mode is what we want for an incremental sync — the old snapshot
had bodies that had been hand-tuned for our auto-translator; raw
re-extraction loses those tweaks and regresses ~250 working tests
back to SKIP (untranslated).
Snapshot updated: spec/tests/hyperscript-upstream-tests.json grows
from 1496 → 1514 tests. All 18 new tests are documented as either
manual bodies (3) or skips (15):
Manual bodies (3):
- on resize from window — dispatches via host-global "window"
- toggle between followed by for-in loop works — direct test
Skips for architectural reasons (15):
- 13× core/tokenizer — upstream exposes a streaming token API
(matchToken, peekToken, consumeUntil, pushFollow…) that our
tokenizer doesn't surface. Implementing it = a token-stream
wrapper primitive over hs-tokenize output.
- 2× ext/component — template-based components via
<script type="text/hyperscript-template">. We use defcomp directly;
no template-bootstrap path.
- 1× toggle does not consume a following for-in loop — parser
ambiguity in 'toggle .foo for <X>'. Parser must distinguish
'for <duration>ms' from 'for <ident> in <expr>'. The 'toggle
between' variant works (different parse path).
Net per-suite status: every individual suite passes 100% on counted
tests (skips excluded). 1496 runnable / 1514 total = 100% on what runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plans + briefings for four new language loops, each with a delcc/JIT
showcase that the runtime already supports natively:
- common-lisp — conditions + restarts on delimited continuations
- apl — rank-polymorphic primitives + 6 operators on the JIT
- ruby — fibers as delcc, blocks/yield as escape continuations
- tcl — uplevel/upvar via first-class env chain, the Dodekalogue
Launcher scripts now spawn 12 windows (was 8).
Previous version ran all 7 claude sessions in the main working tree on
branch 'architecture'. That would race on git operations and cross-
contaminate commits between languages even though their file scopes
don't overlap. Now each session runs in /root/rose-ash-loops/<lang> on
branch loops/<lang>, created from the current architecture HEAD.
sx-loops-down.sh gains --clean to remove the worktrees; loops/<lang>
branches stay unless explicitly deleted.
Also: second Enter keystroke after the /loop command, since Claude's
input box sometimes interprets the first newline as a soft break.
sx-loops-up.sh spawns a tmux session 'sx-loops' with 7 windows (lua,
prolog, forth, erlang, haskell, js, hs). Each window runs 'claude'
and then /loop against its briefing at plans/agent-briefings/<x>-loop.md.
Optional arg is the interval (e.g. 15m); omit for model-self-paced.
Each loop does ONE iteration per fire: pick the first unchecked [ ] item,
implement, test, commit, tick, log — then stop. Commits push to
origin/loops/<lang> (safe; not main).
sx-loops-down.sh sends /exit to each window and kills the session.
Attach with: tmux a -t sx-loops
- scripts/loop-guard.sh — atomic claim with 30-min staleness overtake,
appends NDJSON event to .loop-logs/<lang>.ndjson. Exit 0 = go ahead,
exit 1 = another run is live, skip.
- scripts/loop-release.sh — clear lock, log release with exit status.
Intended for 7 per-language /schedule routines firing every 15 minutes.
Lock detects overlap so tight cadences are safe; stale lock (>30 min)
overtaken automatically if an agent dies mid-run.
Python + shell tooling used to split grouped index.sx files into
one-directory-per-page layout (see the hyperscript gallery migration).
name-mapping.json records the rename table; strip_names.py is a helper
for extracting component names from .sx sources.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Step 1 of the architecture roadmap. The old cssx.sx
(cssx-resolve, cssx-process-token, cssx-template, old tw function)
is superseded by the ~tw component system in tw.sx.
- Delete shared/sx/templates/cssx.sx
- Remove cssx.sx from all load lists (sx_server.ml, run_tests.ml,
mcp_tree.ml, compile-modules.js, bundle.sh, sx-build-all.sh)
- Replace (tw "tokens") inline style calls with (~tw :tokens "tokens")
in layouts.sx and not-found.sx
- Remove _css-hash / init-css-tracking / SX-Css header plumbing
(dead code — ~tw/flush + flush-collected-styles handle CSS now)
- Remove sx-css-classes param and meta tag from shell template
- Update stale data-cssx references to data-sx-css in tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
signal-add-sub! used (append! subscribers f) which returns a new list
for immutable List but discards the result — after signal-remove-sub!
replaces the subscribers list via dict-set!, re-adding subscribers
silently fails. Counter island only worked once (0→1 then stuck).
Fix: use (dict-set! s "subscribers" (append ...)) to explicitly update
the dict field, matching signal-remove-sub!'s pattern.
Build pipeline fixes:
- sx-build-all.sh now bundles spec→dist and recompiles .sxbc bytecode
- compile-modules.js syncs .sx source files alongside .sxbc to wasm/sx/
- Per-file cache busting: wasm, platform JS, and sxbc each get own hash
- bundle.sh adds cssx.sx to dist
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The deployed sx_browser.bc.wasm.js was actually the js_of_ocaml output
(pure JS), not the wasm_of_ocaml loader. Nothing synced the correct
build output from _build/ to shared/static/wasm/.
- sx_build target=ocaml now auto-syncs WASM kernel + JS fallback + assets
- sx-build-all.sh syncs after dune build
- Correct 68KB WASM loader replaces 3.6MB JS imposter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>