rose-ash/plans/lib-guest-scheduler.md

# lib/guest/scheduler — extraction plan

Two distinct concurrency models — Erlang's addressed processes + mailboxes, and
Go's anonymous channels + goroutines — sit on the same underlying machinery:
a fork/yield/block/resume scheduler over CEK io-suspended continuations. This
plan captures that machinery as `lib/guest/scheduler/` so language N+1 with a
new concurrency model costs ~200 lines of model-specific code instead of
re-inventing the scheduler.

Reference: `plans/lib-guest.md` (parent — two-language rule, stratification),
`plans/erlang-on-sx.md` (first consumer, in production), Go-on-SX (second
consumer, see `plans/go-on-sx.md` once that lands).

**Branch:** `architecture`. SX files via `sx-tree` MCP only.

## Thesis

The substrate already provides what a scheduler needs: CEK io-suspension
(`make-cek-suspended`, `cek-resume`) gives suspendable execution; first-class
environments give each unit of execution its own scope; the trampolined
evaluator means we never blow the host stack. What every guest with concurrency
*re-implements* on top of this is the **fork/yield/block/resume protocol** —
the bookkeeping that decides which suspended computation runs next.

Two concrete consumers, two different concurrency vocabularies, sharing one
underlying scheduler, is the proof. If only Erlang lives on it, "scheduler kit"
is a euphemism for "Erlang scheduler with a Go skin." The two-language rule
is the gate.

## Current state (2026-05-26)

- **Erlang-on-SX** has the full pattern in production: 729/729 conformance,
  spawn/send/receive, selective receive, monitor/link, hot reload. The
  scheduler logic is currently coupled to Erlang-shaped concepts (PIDs,
  mailboxes, links) — extraction-blocking but not extraction-defeating.
- **Go-on-SX** does not exist yet. `plans/go-on-sx.md` is the umbrella plan
  (TBD); this scheduler plan is a sibling/dependency.
- **lib/guest/scheduler/** does not exist. The two-language rule blocks
  extraction until Go-on-SX independently implements its scheduler.

**Status: Phase 0 (Erlang shape capture).** No code change in this plan yet.

## Why the two models actually share a kit

The non-obvious claim is that Erlang processes and Go goroutines really do
share machinery beneath their different vocabularies. The mapping:

| Concept | Erlang | Go | Common kit name |
|---|---|---|---|
| Unit of execution | process (PID-addressed) | goroutine (anonymous) | **task** |
| Spawn | `spawn(Fun)` → PID | `go expr` → nothing | `task-spawn` |
| Block target | mailbox match | channel send/recv | `task-block` |
| Wake condition | message arrives | counterpart ready | `task-resume` predicate |
| Yield | `receive` with no match | channel blocked | scheduler hands off |
| Termination | exit reason → linked tasks | panic / return | task lifecycle |
| Selection | selective `receive` | `select` statement | both = "wait for any of N predicates" |

What the kit owns:
- The **task table** (token → suspended CEK continuation + status).
- The **runnable queue** + scheduling policy (round-robin v1; pluggable).
- The **block→resume protocol**: a blocked task registers a predicate; when
  any task changes state, blocked tasks are re-polled; first whose predicate
  fires becomes runnable.
- The **fairness/preemption budget** — gas per step before forced yield.

What each language owns:
- The semantics layer on top: Erlang's PID→task map + mailbox per task +
  selective-receive predicates; Go's channel value → blocked-task list per
  channel + send/recv pairing + select multiplexing.
- The language-visible API (`spawn`/`!`/`receive` vs `go`/`<-`/`select`).

This is exactly the lib/guest pattern: extract the dispatch skeleton, keep
the rules in the language layer.

## API surface (proposed — design only, not yet implemented)

```
(make-scheduler &key gas-per-step ;; default 1000
                     policy)      ;; :round-robin | :fifo
  -> scheduler-handle

(task-spawn sched body-thunk) -> task-token
  ;; body-thunk is a 0-arg fn whose body runs as the task.
  ;; Returns immediately; task is enqueued runnable.

(task-current sched) -> task-token
  ;; Inside a task, the token of the running task. Useful for self-reference.

(task-yield sched) -> nil
  ;; Voluntary yield. Caller is re-enqueued at the tail of runnable.

(task-block sched resume-predicate) -> any
  ;; Caller suspends. Predicate is (fn () -> resume-value-or-#f).
  ;; When predicate returns non-#f, caller resumes with that value.
  ;; Predicate is polled on every scheduler tick when there's nothing
  ;; obviously runnable. (Optimisation: language layer can wake explicitly —
  ;; see task-wake.)

(task-wake sched task) -> nil
  ;; Hint to the scheduler: re-poll this task's resume-predicate now.
  ;; Used by sender-side when a receiver might unblock.

(task-status sched task) -> :runnable | :blocked | :finished | :crashed

(task-result sched task) -> value | {:error reason}
  ;; After :finished or :crashed.

(scheduler-step sched) -> :ran | :idle | :all-done
  ;; Run at most gas-per-step instructions of one task. Caller drives the
  ;; loop.

(scheduler-run sched) -> nil
  ;; Run until :all-done. Equivalent to (until (= :all-done (scheduler-step
  ;; sched))).
```

Notes on the design:
- `task-block` with a resume-predicate is the universal blocking primitive.
  Erlang's `receive` is `(task-block sched (fn () (mailbox-match self pat)))`.
  Go's `<-ch` is `(task-block sched (fn () (channel-recv-ready ch)))`.
- `task-wake` is the optimisation: instead of polling every blocked task
  every step, the language layer wakes the specific task whose predicate
  is now likely true. v1 can omit it; performance work later.
- `gas-per-step` gives fairness without true preemption. Tasks that don't
  yield within their gas budget are force-yielded by the CEK loop. (CEK
  io-suspension already does this for IO; gas budget extends to plain
  instructions.)
- No priority/affinity in v1. Both Erlang and Go default to non-priority
  scheduling; specialised cases (Erlang's high-priority processes) are
  language-layer concerns.

## Build order — phases

This is a long-running plan paced against Go-on-SX. Phases are not loop-style
"one commit per phase" — they're milestone gates.

### Phase 0 — Erlang shape capture (doc-only) ⬜
- Read `lib/erlang/runtime.sx` scheduler code (currently coupled to Erlang
  vocabulary).
- Write a 1-page summary of what's actually a scheduler and what's actually
  Erlang. Identify the boundary.
- **Acceptance:** summary committed to this plan as a new section "Erlang
  scheduler shape (captured 2026-MM-DD)". No code change.
- **Output:** clear-eyed mental model. Without this, we'll merge Erlang's
  scheduler shape into the kit and pretend it generalises.

### Phase 1 — Go scheduler independent implementation ✅
- During Go-on-SX, implement `lib/go/sched.sx` from scratch. Do NOT look at
  Erlang's scheduler while doing this. (Or read it once, then close it.)
- Pass Go's channel + goroutine + select conformance tests.
- **Acceptance:** Go scheduler green, lib/go/scoreboard.json includes scheduler
  tests, two-consumer rule now passable.
- **Output:** two independent, working implementations of the same idea.
- **Status (2026-05-28):** Done. `lib/go/sched.sx` ships channels as
  closure-bundles `(:go-chan SEND RECV CLOSED? CLOSE! LEN)` sharing a
  mutable buffer + closed flag. Goroutines: `go` stmt is v0-synchronous
  (no real preemption — flagged Phase 5b). select dispatches by source
  order picking first ready case; default makes it non-blocking;
  blocking-no-default returns `:select-blocked-no-default` sentinel.
  40 runtime tests + 12 e2e programs use the scheduler primitives.
  **Two-consumer rule passable** — Erlang's scheduler and Go's
  scheduler both exist as independent implementations.

### Phase 2 — Diff and proposed kit ⬜
- Side-by-side diff: Erlang's scheduler vs Go's scheduler. Where do they
  agree? Where does each have language-specific bookkeeping?
- The diff is the kit. Things in *both* go in `lib/guest/scheduler/`; things
  in only one stay in `lib/erlang/` or `lib/go/`.
- Draft `lib/guest/scheduler/api.sx` (signatures only, no body) reflecting the
  proposed surface.
- **Acceptance:** API draft circulated for review; agreement that the surface
  covers both consumers; no merge yet.

### Phase 3 — Implement `lib/guest/scheduler/` ⬜
- Implement the kit per the agreed API. New file(s) in `lib/guest/scheduler/`.
- The kit has its own tests in `lib/guest/scheduler/tests/` — agnostic of any
  particular language vocabulary.
- **Acceptance:** kit tests pass. Erlang and Go conformance scoreboards
  unchanged (the language implementations still use their own scheduler —
  we haven't refactored yet).

### Phase 4 — Refactor Erlang to use the kit ⬜
- `lib/erlang/runtime.sx` scheduler logic deleted; replaced with calls into
  `lib/guest/scheduler/`. Erlang's PID table, mailbox-per-PID, selective
  receive stay in `lib/erlang/`.
- **No-regression gate:** Erlang conformance holds at current pass count
  (currently 729/729). Hard requirement.
- **Acceptance:** Erlang scoreboard unchanged; `lib/erlang/runtime.sx`
  meaningfully smaller (the scheduler code is gone).

### Phase 5 — Refactor Go to use the kit ⬜
- Same exercise for Go. `lib/go/sched.sx` shrinks to channel/goroutine
  bookkeeping + delegation.
- **No-regression gate:** Go conformance scoreboard at its current pass
  count.
- **Acceptance:** Go scoreboard unchanged; `lib/go/sched.sx` meaningfully
  smaller.

### Phase 6 — Documentation + design-diary close ⬜
- Document `lib/guest/scheduler/` API in `lib/guest/README.md` (or wherever
  the lib/guest API index lives).
- Capture the chiselling diary: what *almost* went in the kit but ended up
  language-specific, and why. This is the load-bearing knowledge for the
  third consumer when it arrives.
- **Acceptance:** API documented; diary section added to this plan.

## Two-language rule — gating

**The rule is hard.** No code in `lib/guest/scheduler/` lands until BOTH
Phase 1 (Go independent) AND Phase 0 (Erlang capture) are complete AND a
review confirms the two implementations actually share machinery in a way
the kit captures.

If, during Phase 2 diff, we discover that the agreement is shallow (e.g.,
both have a runnable queue but the policies are fundamentally incompatible),
the **right outcome is to NOT extract**. Add a "rejected extraction" note to
this plan documenting what we learned and close it. That outcome is fine —
it tells us the two concurrency models aren't actually sister, which is a
real result.

## Open questions

- **Preemption.** v1 is cooperative; gas-per-step gives fairness but not
  hard preemption. Erlang BEAM does true preemption (reduction counting).
  Go uses async preemption (signal-driven since 1.14). Neither extreme fits
  cooperatively over CEK. Is gas-per-step + voluntary yield enough? Probably
  for v1; revisit if a guest needs hard real-time.
- **Priority/affinity.** Both Erlang and Go can run without it. Defer.
- **Distribution.** Erlang nodes, Go's distributed channels — both are
  language-specific layers on top of the local scheduler. Out of scope.
- **Cancellation.** Go has `context.Context`; Erlang has `exit/2`. Both
  bottom out at "deliver an exception to a task." Worth modelling? Probably
  as a kit primitive `(task-cancel sched task reason)` that delivers an
  exception via CEK exception machinery, language layer wraps it.
- **Third consumer.** If/when JS-on-SX gets a proper async/await + Promise
  scheduler, that'd be a great third consumer to validate the kit didn't
  over-fit to Erlang+Go.

## Progress log

_Newest first. Append one dated entry per milestone landed._

- 2026-05-28 — **Go-on-SX consumer-side surface fully landed (609/609
  tests across 7 suites).** This is the Phase-10 cross-reference
  entry: with all of Go's lex+parse+types+eval+sched+stdlib+e2e
  proven independent of the eventual kit, the scheduler-kit
  surface that emerged from this consumer is:

    **Primitives (locked in):**
    1. `(:go-chan SEND RECV CLOSED? CLOSE! LEN)` — closures-over-
       mutable-state channel. Identity matters (distinct `make()`
       calls produce distinct closures, `(= ch1 ch2)` false).
    2. `(:go-defer CALLEE FROZEN-ARGS)` — frame-attached cleanup
       record. Args evaluated at defer-time; call deferred to
       frame exit.
    3. `__go-defer-stack` — frame-local mutable list of
       defer records. Drained LIFO at frame exit by `go-run-defers!`.
    4. `__go-panic-cell` (STATE V) — frame-attached out-of-band
       channel. STATE ∈ {:none, :raised, :recovered}. `recover()`
       walks env chain to find the outermost :raised cell.
    5. `(:go-panic V)` — propagating sentinel.
    6. v0 stub `after(d)` — channel already buffered with `:tick`.
       Real time becomes a refinement of *when* readiness flips,
       not of the protocol.

    **Cross-cutting abstractions (chiselled):**
    - **Readiness protocol** (sched-pick): `select` consults
      `ready?` over its cases; send/recv/timer/etc. all factor
      through one predicate. See 2026-05-27 entry.
    - **Frame-cleanup queue vs scheduler ready-queue** — distinct
      orthogonal slots; conflating them was an early temptation
      and stays explicit in the design.
    - **Control-flow sentinels unify** at every AST boundary
      (block, for, range-for, stmt-catch-all, program-loop): each
      needs the same `propagates?` predicate inline. Kit should
      expose ONE helper instead of N inline arms.

    **v0 limitations the kit must lift** (durable in commit trail):
    - Real preemption (Phase 5b — needs reified execution state)
    - Buffered/unbuffered channel distinction (currently unbounded)
    - select fairness (currently source-order; spec wants random)
    - Real-time clocks for `after`

    Next sister-plan-owned step is Phase 2 (diff + propose kit)
    with Erlang's existing scheduler as the second consumer.

- 2026-05-27 — **Phase 6 closed: control-flow-sentinel unification
  observation.** After wiring panic propagation through 4 sites
  (go-eval-block, go-eval-for, go-eval-stmt's catch-all, go-eval-
  program-loop), a clear pattern emerged: every control-flow boundary
  needs the same dispatch arm — check for `:return-value`, `:break`,
  `:continue`, `:eval-error`, `(:go-panic ...)` — in the same order.
  Adding a new sentinel (say `:goroutine-killed` from a real
  preemption model) means hunting for every site and adding another
  arm. This is precisely the kind of cross-cutting concern a
  scheduler kit should abstract.

  **Concrete kit hint:** define ONE `propagates?` predicate +
  helper:

    ```
    (define (control-sentinel? r)
      (or (terminal-return? r)
          (break? r) (continue? r)
          (raised-error? r) (raised-panic? r)
          (goroutine-killed? r)))
    ```

  Every control-flow site calls this once. New sentinel = one place
  to add an arm; not 7. The kit's `frame-driver` should expose this
  primitive so guest evaluators (Go, Erlang, future targets) all
  share the dispatch logic and only differ on which sentinels they
  emit.

  This is the second cross-cutting abstraction (after panic cell +
  defer queue) the Go consumer has chiselled out. The pattern is:
  scheduler kit primitives = "things every guest evaluator's control-
  flow boundary needs once" — not "things only the scheduler runtime
  needs." The scheduler runtime is the *driver*; the boundary
  primitives are kit-grade shared infrastructure.

- 2026-05-27 — **Phase 6: panic/recover shape lands.** The panic
  cell is the missing piece. It's a per-frame mutable record of
  shape `(STATE VALUE)` carrying one of `:none` / `:raised` /
  `:recovered`. Three properties matter for the scheduler kit:

    1. **It survives the function boundary** via env-chain lookup —
       when a deferred call's own frame creates a shadowing cell,
       `recover()` walks past it to find the OUTER frame's cell (the
       one that's `:raised`). This is the same mechanism the
       scheduler will need when a panic-unwinding goroutine has
       multiple frames each carrying their own state, and the
       "current panic" must be locatable from any depth.

    2. **It flips state in place** (`set-nth!`) so that the change
       made by `recover()` deep in a defer chain is visible to the
       enclosing frame's exit check. The scheduler kit needs the
       same pattern: a goroutine's "termination reason" must be
       writable by any frame in its stack.

    3. **It's distinct from the return-value channel.** A frame can
       carry both `(:go-panic V)` from its body AND a recovery
       commitment in its panic cell; they're checked in sequence.
       For the scheduler this maps to: a goroutine carries both its
       running-state (channel-blocked, ready, sleeping) AND its
       termination-record (panic V / clean exit / killed) — two
       orthogonal slots, not one tag.

  Concrete kit hint: every frame record should expose
  `frame-panic-cell` alongside `frame-defer-queue`. The scheduler's
  exit-path becomes: drain defers (cell may flip :raised→:recovered)
  → consult cell → either propagate or return clean. Erlang's
  `try/catch/after` decomposes identically: `after` is the defer
  queue, `catch` is the recover-via-cell mechanism.

- 2026-05-27 — **Phase 6 first slice: defer + LIFO observation.**
  Go's defer is a *frame-local cleanup queue* — a list of (callee,
  pre-evaluated-args) records appended on `defer`, drained LIFO at
  frame exit. The scheduler kit needs the same shape because: (a) a
  panicking goroutine must run its frame's defers before unwinding to
  the next frame; (b) a goroutine that exits cleanly still runs them;
  (c) `select` cases that own resources (an acquired send slot, a
  buffer reservation) need a cleanup hook on the case-not-taken path.
  All three reduce to the same primitive: **"hand the frame a list
  of thunks; call them LIFO before the frame is gone."**

  Concretely the kit should expose `frame-defer!` (push) and an
  internal `frame-teardown!` (drained by the scheduler on exit / by
  the panic unwinder on abort). The scheduler's exit-path becomes:

    1. Mark frame done.
    2. Call `frame-teardown!` — run defers LIFO. A defer that itself
       panics: capture the new panic, continue running the rest
       (matches Go spec).
    3. Release frame slot.

  Crucially the defer queue is *not* the same as the scheduler's
  ready-queue — confusing the two was an early temptation. The defer
  queue is per-frame and synchronous-on-exit; the ready-queue is
  global and async. Phase 5b will need to keep these distinct when
  real preemption lands.

  Test signal that drove the shape: SX assignment shadows rather than
  mutates, so the only observable side-effect channel for deferred
  calls is `(append! buf ...)` on a value with stable identity (e.g.
  a channel). That maps cleanly to "deferred work emits its effects
  through capabilities the frame held, not through enclosing-env
  mutation" — which is also how the scheduler kit's deferred work
  should communicate with the rest of the system. No magic; just
  capabilities the frame can hand to its defers.

- 2026-05-27 — **Phase 5 acceptance crossed (40 runtime tests).**
  Final shape observation: *time-as-readiness-flip*. The Go side
  added an `after(d)` builtin that returns a channel **already
  holding** a tick value — duration is ignored in v0. The select
  loop doesn't care that the channel got its value "via time"; it
  only consults `ready?`. This separates two concerns the eventual
  kit had been conflating:

    1. **The wake-up protocol** — what `select` asks of every case:
       "are you ready right now?" Channel-recv answers via "buffer
       non-empty or closed"; channel-send via "buffer has room";
       timer via "deadline reached." All three flatten to a single
       `ready?` predicate.

    2. **The scheduling oracle** — *when* a case's `ready?` flips
       from false to true. For channels this is driven by other
       goroutines sending/receiving; for timers it's driven by a
       wall-clock or monotonic source.

  v0 collapses #2 (timer = ready immediately, sends always ready,
  recvs ready iff buffer non-empty) and exposes #1 as the only
  thing the dispatcher needs to know. Phase 5b refines #2 with
  blocking semantics and real time, but #1 stays the same shape.

  Concretely: the kit's `select-case` should take `:ready?-fn` per
  case, not three different "is-this-a-send-or-recv-or-timer" tags.
  Send/recv/timer become factory functions that produce a
  `(:ready? FN :commit! FN)` record — the dispatcher walks cases,
  picks the first whose `ready?` returns true, calls `commit!` to
  extract the value (and side-effect: drain buffer, fire timer).
  This is the same shape as a STM transaction over case-set, and
  matches Erlang's `receive` clauses too (each pattern is a
  ready-predicate + commit-action over the mailbox head).

  Ping-pong remains impossible in v0 because the synchronous spawn
  collapses the `ready?`-flip oracle to "always immediate" — the
  spawned goroutine can never park waiting for the parent to send.
  Phase 5b must restore the wake-up dimension; until then the kit
  spec should encode the readiness-protocol design even though the
  oracle is degenerate.

- 2026-05-27 — From Go-on-SX Phase 5 first slice: the channel
  primitive landed as closures-over-mutable-state in
  `lib/go/sched.sx`. Concrete shape:

  ```
  (list :go-chan SEND-FN RECV-FN CLOSED?-FN CLOSE!-FN)
  ```

  Each closure captures a shared `buf` (a mutable list) and `closed`
  flag (a let-bound boolean mutated via `set!`). Identity: two
  `make()` calls produce distinct closures, satisfying Go spec
  § Channel types' "distinct channels with same type" rule.

  **Design insight for the kit**: the channel-as-closure-bundle shape
  is the right scheduler-kit primitive — implementation-hide the
  buffer behind opaque accessor closures, so the underlying storage
  can be swapped (linked list → ring buffer → segmented array) without
  changing the API. Erlang's mailboxes will need the same trick.

  **v0 limitation logged**: no real preemption. SX doesn't expose
  first-class continuations to guest code, so v0 runs `go f()`
  synchronously and relies on the spawned goroutine completing before
  the main goroutine receives. Real concurrent semantics — blocking
  send on full buffer, blocking recv on empty — needs the
  scheduler kit to ship the suspension/resumption machinery (or for
  Phase 5b to bake CEK-style trampolining into the eval layer).

  Cross-ref: the `:select-case` uniform shape from the parser-side
  diary entry pairs with this — the kit's `sched-select` should
  accept a list of channel-op cases (built from the closures-over-
  state primitives logged here) and pick a ready one. Source:
  Go-on-SX commit landing `lib/go/sched.sx` first cut.

- 2026-05-27 — Follow-up from same Phase 2 work: **`select` AST shape**
  landed. Each case is `(list :select-case COMM-STMT BODY)` where
  COMM-STMT is one of `:send`, `:short-decl` (recv into new var),
  `:assign` (recv into existing var), or a bare receive expression
  `(:app (:var "<-") [chan])`. The shape is uniform across all four
  comm-stmt kinds — the kit's `sched-select` primitive should accept a
  list of cases each described by `(direction chan value-target?)` and
  let the kit's runtime pick a ready case. That uniformity is what
  makes a single kit primitive cover all four Go case shapes.

  Also: Go's `select` with `default` makes the multiplexer non-blocking;
  without default it blocks until a case is ready. The kit primitive
  should mirror this — present-or-absent default determines blocking
  semantics. Erlang's `receive ... after Timeout -> ...` is a similar
  pattern with a timeout case rather than default; the kit primitive
  should handle both as instances of "non-blocking-fallback case."
  Source: Go-on-SX commit `parse.sx — switch + select`.

- 2026-05-27 — From Go-on-SX Phase 2 (parser side, ahead of scheduler
  implementation): the **parsed AST shapes** for Go's concurrency
  primitives have landed and are worth recording before Phase 5 builds
  the scheduler.

  ```
  go EXPR              → (list :go EXPR)
  defer EXPR           → (list :defer EXPR)
  ch <- v              → (list :send CHAN VALUE)
  <-ch                 → (list :app (:var "<-") [CHAN])   ; unary recv
  for range COLL { }   → (list :range-for nil nil nil COLL BODY)
  for k, v := range C  → (list :range-for :short-decl KEY VAL COLL BODY)
  ```

  **Design insight for the kit**: the `:go` and `:defer` shapes are
  pleasingly minimal — both wrap a single expression. Erlang's
  `spawn(Mod, Fun, Args)` will produce something more elaborate; the
  scheduler kit primitive `(sched-spawn task)` should accept a thunk so
  both languages reduce to a uniform spawn API.

  The `:send` shape carries CHAN + VALUE — symmetric with channel-recv
  as the unary `<-` form. Once the scheduler has channel primitives,
  both shapes thunk-down to a single `(chan-op direction chan value)`
  abstraction.

  Range over channels (`for v := range ch`) is currently parsed as
  range-for with `coll = ch`; the scheduler kit will dispatch on the
  type of `coll` at execution time (channels yield via receive,
  collections via iteration). This dispatch is the right place for the
  scheduler kit to express the channel-receive ⇄ iteration polymorphism.
  Source: Go-on-SX commit `parse.sx — go/defer/send/range`.

- 2026-05-26 — Plan drafted. Phase 0 unstarted. Awaiting Go-on-SX to begin
  Phase 1.