Merge loops/fed-sx-m2 into architecture: federation milestone 2

m2 lands multi-actor + cross-instance federation on the fed-sx
substrate. Feature-complete except 8b-timer (retry-loop wiring,
gated on erlang:send_after substrate primitive in loops/erlang).

Highlights:
- Multi-actor gen_server kernel (one nx_kernel handles N actors)
- Per-actor HTTP routes /actors/<id>/{inbox,outbox} + actor-doc
- Inbound signature verify + peer-AS cache + auto-Accept publish
- Outbound delivery_set with audience expansion + delivery_worker
- Native httpc:request/4 BIF wrapper + live HTTP dispatch
- Discovery: peer-actor fetch + cache on demand
- Backfill on Follow accept (in-process + paginated outbox)
- Two-instance smoke test passes 6/6 (real cross-host HTTP flow)

Substrate fixes carried in this merge (textually identical to
upstream-arrived copies, will conflict on scoreboard files only):
- Blockers #1: er-bif-http-listen marshaller bridge rewrite
- Blockers #4: er-sched-step-alive! :pending-args extension
  (lets receive in a kernel-aware route suspend+resume cleanly)

Conformance 761/761 still green on m2 tip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

# Conflicts:
#	lib/erlang/runtime.sx
This commit is contained in:
2026-06-28 16:57:55 +00:00
151 changed files with 20041 additions and 25 deletions

View File

@@ -0,0 +1,197 @@
# fed-prims handler-mutex deadlock fix (one-shot)
Role: fix the SX runtime mutex deadlock in `bin/sx_server.ml`'s
`http-listen` handler that blocks every `gen_server:call` from inside
an Erlang route. Documented as **Blockers #4** in
`/root/rose-ash-loops/fed-sx-m1/plans/fed-sx-milestone-2.md`.
```
description: fed-prims handler-mutex deadlock fix
subagent_type: general-purpose
run_in_background: true
isolation: worktree
```
## Worktree + branch
Already provisioned at `/root/rose-ash-loops/fed-prims` on branch
`loops/fed-prims` (the fed-prims phases AJ are landed; this is a
follow-up fix). Start there. Never push to `main` or `architecture`.
If `.mcp.json` shows a non-absolute `mcp_tree` path or `.claude/
scheduled_tasks.lock` is dirty, just leave them alone — they're
harness state. Stash if you must, but don't commit them.
## The problem (verified by fed-sx-m2 loop, 2026-06-07)
Native `http-listen` in `hosts/ocaml/bin/sx_server.ml:735+`
serialises handler calls with `Mutex.lock mtx` / `Mutex.unlock mtx`
so the SX runtime isn't re-entered concurrently:
```ocaml
Mutex.lock mtx;
let resp =
(try Sx_runtime.sx_call handler [Dict req]
with e -> Mutex.unlock mtx; raise e) in
Mutex.unlock mtx;
```
When the Erlang handler does `gen_server:call(nx_kernel, ...)` from
any kernel-aware route (`actor_doc_response_for/3`,
`actor_outbox_response_for/3`, `handle_inbox_post`,
`nx_kernel:state_for/1`, etc.), the gen_server's reply needs the SX
runtime scheduler to run — but the calling handler is sitting on the
runtime mutex. Deadlock; curl hangs until `--max-time` fires.
**Verification recipe (reproduces deterministically):**
```bash
PORT=51920
cat > /tmp/boot.sx <<'SX'
(epoch 1)
(load "lib/erlang/tokenizer.sx") (load "lib/erlang/parser.sx")
(load "lib/erlang/parser-core.sx") (load "lib/erlang/parser-expr.sx")
(load "lib/erlang/parser-module.sx") (load "lib/erlang/transpile.sx")
(load "lib/erlang/runtime.sx") (load "lib/erlang/vm/dispatcher.sx")
(epoch 2)
(eval "(er-load-gen-server!)")
(eval "(get (erlang-load-module (file-read \"next/kernel/envelope.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/log.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/pipeline.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/term_codec.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/outbox.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/nx_kernel.erl\")) :name)")
(eval "(get (erlang-load-module (file-read \"next/kernel/http_server.erl\")) :name)")
(epoch 20)
(eval "(erlang-eval-ast \"AK = <<1,1,1,1>>, AKS = [{key_id,k1},{algorithm,ed25519},{value,AK}], AAS = [{public_keys,[[{id,k1},{created,0},{value,AK}]]}], nx_kernel:start_link(alice, AKS, AAS), http_server:start(51920, [{kernel, nx_kernel}])\")")
SX
mkfifo /tmp/fifo
( cat /tmp/boot.sx; sleep 120 ) > /tmp/fifo &
hosts/ocaml/_build/default/bin/sx_server.exe < /tmp/fifo > /tmp/log 2>&1 &
sleep 60 # boot takes ~30-45s cold
curl -sv --max-time 5 "http://127.0.0.1:$PORT/" >/dev/null # OK: 200
curl -sv --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox" # HANGS
```
The `next/kernel/*.erl` files referenced live in the fed-sx-m2
worktree at `/root/rose-ash-loops/fed-sx-m1/next/kernel/`. You can
read them there for context but do NOT edit them — Erlang-side
work is m2's loop. This loop only touches `hosts/ocaml/bin/sx_server.ml`.
## Two fix patterns
Pick **one**. Both are independent enough to evaluate alone; commit
the one that lands first.
### Pattern A — release the mutex around the SX call
The mutex exists to serialise SX runtime mutation. But once the
runtime hands the call off to the gen_server (which has its own
scheduler frame), the calling thread is just waiting on a reply
message; it doesn't need the mutex. The fix is to scope the mutex
*only* over the runtime entry, not the entire handler invocation.
This may require restructuring `Sx_runtime.sx_call handler [Dict req]`
so the call yields to the scheduler instead of blocking — verify by
reading `hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` lives).
If `sx_call` is fully synchronous and re-entry is genuinely unsafe,
fall back to Pattern B.
### Pattern B — spawn handler in a fresh er-process
Erlang processes already have their own scheduler frame. Have the
handler closure trampoline through `er-spawn-fun` (or equivalent —
check `lib/erlang/runtime.sx`'s existing process primitives) so the
gen_server reply runs in a different frame from the http-listen
accept-loop thread.
This may be cleaner if it can be done entirely at the SX/Erlang
layer (in `er-bif-http-listen` in `lib/erlang/runtime.sx`), in which
case **this is m2 scope** and you should hand it back rather than
edit OCaml. Read the BIF body first — if a pure-Erlang spawn
suffices, document that and stop without committing OCaml changes.
The BIF body is at `lib/erlang/runtime.sx:1581-1632` (in the
fed-sx-m2 worktree); the m2 loop just rewrote its inbound/outbound
marshallers (commit `8d33d02f`). The handler is invoked inside
`(http-listen port sx-handler)` — figure out whether you can
`er-spawn-fun` around the body of `sx-handler` such that the
spawned process's gen_server:call doesn't fight the parent's
runtime mutex.
## Acceptance — the unblock target
`next/tests/http_server_tcp.sh` 5/5 stays green (the existing simple
GET / + capabilities + 404 + 401 surface). PLUS:
A kernel-touching request over real HTTP must return without
hanging. The minimal smoke for this is:
```bash
# In the verification recipe above, after boot:
curl -s --max-time 5 "http://127.0.0.1:$PORT/actors/alice/outbox"
# Expected: "outbox: alice\ntip: 0\n" or similar (200 with body),
# NOT a timeout.
```
If you want a one-shot script, save the recipe above as a regression
test inside the fed-prims worktree:
`hosts/ocaml/test/handler_kernel_unblock.sh` (new file). Make it
pass deterministically with a generous timeout (≥120s for the cold
boot).
## Ground rules (hard)
- **Scope:** `hosts/ocaml/bin/sx_server.ml` and adjacent
`hosts/ocaml/lib/sx_runtime.ml` (or wherever `sx_call` is
defined). Do NOT touch `next/**` or `plans/fed-sx-milestone-2.md`
(m2's loop owns those). Do NOT touch `lib/erlang/**` (Erlang
substrate / loops/erlang owns that).
- **No-regression gate:**
- `dune build bin/sx_server.exe` (native) green
- `bash hosts/ocaml/browser/test_boot.sh` (WASM kernel) green
- `bash lib/erlang/conformance.sh` 761/761
- `bash next/tests/http_server_tcp.sh` 5/5
- **WASM safety:** Pattern A may need Thread / Mutex juggling
that isn't WASM-safe. The `http-listen` primitive is already
native-only, so changes to its handler code don't need to
build under WASM — but anything in `lib/sx_runtime.ml` does.
If your change has to add `Thread`/`Mutex` to `lib/`, you've
picked the wrong fix; back out.
- **Builds are slow.** `dune build` ≥600s timeout. `conformance.sh`
≥400s. `test_boot.sh` ≥60s.
- **Commit granularity:** one fix, one commit. Title like:
`fed-prims: release runtime mutex around gen_server:call (Blockers #4)`.
- **No `.sx` edits.** All work is `.ml` (or `.sh` for the
regression test). sx-tree MCP is not needed.
- **Worktree:** commit, push `origin/loops/fed-prims`. Never
`main`, never `architecture`. The user merges to architecture
separately.
## What to write back
Append one dated line to `plans/fed-sx-host-primitives.md`'s
Progress log (newest first):
```
- 2026-06-07 — Resolved fed-sx-m2 Blockers #4 (handler mutex
deadlock). <one-sentence description of the fix>. Verified via
hosts/ocaml/test/handler_kernel_unblock.sh + http_server_tcp.sh
5/5 + conformance 761/761 + WASM boot.
```
Once landed, the fed-sx-m2 loop will pick up the fix on its next
tick and unblock Step 12 — you don't need to coordinate.
## If it's not Pattern A or Pattern B
If you discover the deadlock is something else entirely
(e.g., a gen_server config issue, a different lock in
`Sx_runtime`, a bug in `er-load-gen-server!`'s scheduler frame),
document what you found in a fresh Blockers entry on
`plans/fed-sx-host-primitives.md` and stop. The m2 loop will
re-check on its next tick. **Do not invent a Pattern C without
clear evidence** — the deadlock is reproducible and the two
patterns above cover the obvious fix shapes.
Go. Reproduce the deadlock first. Pick a pattern. Land it. Push.

View File

@@ -0,0 +1,228 @@
# fed-sx Milestone 2 loop agent (single agent, step-ordered)
Role: iterates `plans/fed-sx-milestone-2.md` forever. Builds multi-actor +
federation on top of the M1 closeout. One feature per commit.
```
description: fed-sx Milestone 2 federation loop
subagent_type: general-purpose
run_in_background: true
isolation: worktree
```
## Prompt
You are the sole background agent working `plans/fed-sx-milestone-2.md`.
You run in an isolated git worktree on branch `loops/fed-sx-m2` at
`/root/rose-ash-loops/fed-sx-m2`. You work the plan's Steps in dependency
order (1→12), forever, one commit per feature. Push to
`origin/loops/fed-sx-m2` after every commit. Never `main`, never
`architecture`.
## Restart baseline — check before iterating
1. Read `plans/fed-sx-milestone-2.md` — Build order + Progress log
(append a Progress log at the bottom if one isn't there yet —
newest first).
2. `ls next/kernel/` — every M1 kernel module should still be present
(12 files: nx_cid, envelope, log, log_server, term_codec, registry,
pipeline, projection, outbox, bootstrap, define_registry, sandbox,
nx_kernel, http_server). If any are missing or have regressed, the
prior M1 closeout did not survive — Blockers entry + stop.
3. Erlang substrate must be green:
`cd lib/erlang && bash conformance.sh 2>&1 | tail -2` → expect at
least `761 / 761`. (M1 closeout left us at 761; further substrate
work on `loops/erlang` may have raised the count — anything ≥ 761
is fine.) If broken and not by your edits, Blockers entry + stop.
4. M1 test suites must be green:
`for t in next/tests/*.sh; do bash "$t" 2>&1 | tail -1; done` — every
one should report `ok N/N passed`. If anything fails and not by your
edits, Blockers entry + stop.
5. Read the §13 federation section of `plans/fed-sx-design.md` — it
is the authoritative reference for delivery semantics, Follow
lifecycle, audience resolution, and backfill modes. The plan refers
to it; honour it.
## The build queue
Each Step has concrete deliverables + tests + acceptance check in the
plan. Within a Step, pick the smallest unchecked sub-deliverable. Don't
batch Steps.
- **Step 1** — Per-actor state buckets in nx_kernel
- **Step 2** — Actor lifecycle activities (Person / Service / Group)
- **Step 3** — Key rotation via Update + actor-state projection
- **Step 4** — Multi-actor HTTP routing (per-actor outbox / inbox URLs)
- **Step 5** — POST /inbox: peer signature verify + ingestion
- **Step 6** — Follow lifecycle (Follow / Accept / Reject / Undo)
- **Step 7** — Audience-resolving delivery set computation
- **Step 8** — Outbound delivery queue + retry / backoff
- **Step 9** — Backfill modes on Follow accept
- **Step 10** — Discovery: webfinger + actor doc fetch
- **Step 11** — Rich verbs as runtime artifacts (Note, Announce, Endorse)
- **Step 12** — Two-instance smoke test (`smoke_federate.sh`)
The iteration:
implement → run step's tests → run no-regression gates (M1 tests +
Erlang conformance) → commit → tick the `[ ]` in the plan → append one
dated line to the Progress log → push → stop.
## How fed-sx-m2 code lives in this repo
Same patterns as M1. Recap:
1. **Kernel modules as `.erl` source files** at `next/kernel/*.erl`.
Loaded at boot via `code:load_binary(Mod, Filename, SourceString)`.
Example: `next/kernel/follower_graph.erl` with
`-module(follower_graph). -export([fold/2, ...]).`
2. **Genesis bundle entries** at `next/genesis/**/*.sx`. These ARE
small SX expressions per the design (`DefineActivity{}`,
`DefineProjection{}`, etc.). New verbs introduced in Step 11
(Note, Announce, Endorse) live here.
3. **Test scripts** at `next/tests/*.sh`. Each one feeds an epoch
protocol script to `hosts/ocaml/_build/default/bin/sx_server.exe`
that loads kernel modules, drives them, and asserts on output.
4. **Two-instance test scripts** (Step 12) live at
`next/scripts/start_pair.sh`, `next/scripts/stop_pair.sh`. They
manage the lifecycle of two kernel instances on distinct ports.
The `epoch` protocol pattern (unchanged from M1):
```bash
printf '(epoch 1)\n(load "lib/erlang/runtime.sx")\n(epoch 2)\n<test-expr>\n' \
| hosts/ocaml/_build/default/bin/sx_server.exe
```
## Substrate available to you
M1 left us with a fully wired Erlang-on-SX runtime: 761/761 conformance,
50+ test suites, kernel state + HTTP layer + outbox/projection
infrastructure ready to extend. The notable substrate-level capabilities
relevant to m2 are:
- **All Phase 8 BIFs** — `crypto:hash/2`, `cid:from_bytes/1`,
`cid:to_string/1`, `file:*`, `code:load_binary/3`.
- **Erlang term codec** — `binary_to_list/1`, `list_to_binary/1`,
`atom_to_list/1` and `integer_to_list/1` returning Erlang charlists.
- **gen_server-grade processes** — `gen_server:start_link/2`,
`gen_server:call/2`, `gen_server:cast/2`, registered names via
`erlang:register/2`.
- **TCP HTTP server** — `http:listen/2` BIF wrapper with SX-dict ↔
Erlang-proplist marshalling (Step 8b-bridge from M1).
Native HTTP **client** primitive (registered in `bin/sx_server.ml`):
- `http-request` — exposed at the SX layer, currently native-only.
For Step 8 (delivery queue) you'll need to expose this as an Erlang BIF.
Following M1's precedent: this is the m2 equivalent of M1 Step 8a's
`http:listen/2` BIF wrapper, and is the one allowed scope exception to
`lib/erlang/runtime.sx` for this loop. Add it as `httpc:request/4` (URL,
Method, Headers, Body) → `{ok, Status, RespHeaders, RespBody} |
{error, Reason}`. Flag the exception explicitly in the commit message.
**Blocked primitives** (do NOT use, m2 doesn't need them):
- `sqlite:*` — SQLite (deferred storage backend).
- TLS — m2 is plaintext localhost only.
## Ground rules (hard)
- **Scope:** only `next/**` and `plans/fed-sx-milestone-2.md`. Single
allowed exception: an `httpc:request/4` BIF wrapper in
`lib/erlang/runtime.sx` for Step 8 (one commit, clearly flagged).
Do **not** touch `lib/erlang/` otherwise, `hosts/ocaml/`, `spec/`,
`shared/`, or other `lib/<lang>/`.
- **M1 baseline immutable.** Every existing `next/tests/*.sh` from M1
must continue to pass. Add new tests as `next/tests/m2_*.sh` *or*
with the same naming convention (`http_*`, `outbox_*`,
`nx_kernel_*` etc.) as long as they don't collide with existing
files.
- **Erlang-on-SX is the substrate.** Kernel modules are `.erl` source
loaded via `code:load_binary/3`. Don't reach for pure SX or Python.
- **No new opam deps.** No new host primitives. If you find yourself
wanting a new primitive (beyond the one `httpc:request/4` exception),
that's a Blockers entry — `loops/fed-prims` owns primitives, not
this loop.
- **No-regression gates:**
- After every commit, `bash lib/erlang/conformance.sh` must report
≥ 761/761.
- After every commit, **every** M1 `next/tests/*.sh` must still
pass. New m2 tests are additive.
- Test all of the above before pushing.
- **Builds are slow.** `dune build` (if you ever need it — you
shouldn't) gets `timeout: 600000`. Conformance gate: `timeout:
400000`. If a build genuinely hangs > 10min, Blockers entry + stop.
- **Commit granularity:** one feature per commit. Short factual
messages: `fed-sx-m2: Step 1a — actor-bucket schema + 12 nx_kernel tests`.
Update plan checkboxes + Progress log in the SAME commit as the
feature.
- **`.erl` / `.sh` / `.md` files:** ordinary `Read` / `Edit` / `Write`.
The hook only blocks `.sx` / `.sxc`. For `.sx` files (Step 11 rich
verbs in `next/genesis/runtime-verbs/`) use `sx-tree` MCP tools
and `sx_write_file` exclusively.
- **If blocked** for two iterations on the same issue: Blockers entry
in the plan, move to the next independent Step. Step dependencies
in the plan's build order table.
## Two-instance test harness
Step 12's `smoke_federate.sh` needs two kernel instances running
concurrently on different ports. The technique:
1. Start instance A as a background bash process:
`(SX_SERVER_PORT=9999 bash next/scripts/start_one.sh alice &)`.
2. Start instance B the same way on port 9998 with `bob`.
3. Drive them both with curl.
4. Stop with `kill %1 %2` or by pidfile.
The kernel `bootstrap:start/3` already takes ActorId + KeySpec +
ActorState, so the two instances can be spun up via:
```bash
printf '(load "lib/erlang/runtime.sx")\n...' \
| hosts/ocaml/_build/default/bin/sx_server.exe -port 9999 &
```
`sx_server.exe` doesn't (yet) take a `-port` flag — but the actual
listening happens via `http_server:start/1`, which is called inside
your Erlang setup. So you'll need to pass port as an env var that
the boot script reads. Implement that in Step 12.
## Specific gotchas (M1 + new ones)
- **Erlang port quirks** (M1-era, still apply):
- `<<"...">>` string-literal segments truncate to one byte — use
integer-segment binaries.
- `fun name/arity` reference syntax unsupported — wrap with
`fun (X) -> name(X) end`.
- `?MODULE` macro unsupported — use literal atoms.
- Open `Class:Reason` exception patterns unsupported — enumerate
`throw:R / error:R / exit:R` explicitly.
- Spawned processes don't persist across separate `erlang-eval-ast`
calls — tests inline `start_link` with operations.
- **gen_server:start_link returns raw Pid** not `{ok, Pid}` (M1 §5b).
- **HTTP request bodies are binaries**, not JSON-decoded structures.
Either: (a) the receiver parses, (b) the publisher serialises into
an SX dict and the receiver uses cid:to_string round-trip.
Pick one and stay consistent for the m2 wire format. Probably (b)
for v2 since we have no JSON BIF.
- **Federation IS HTTP** — no special internal protocol. Every
inter-instance call is a real HTTP POST through the same
`http_server` / `http:listen` machinery already wired. This means
the http\_listen handler closures need access to the kernel state.
Cfg-based handler injection (M1 §8c-post-auth) is the pattern.
## Style
- No comments in `.erl` unless non-obvious. Cite design §-numbers
when a decision is non-obvious to a reader.
- No new planning docs — update `plans/fed-sx-milestone-2.md`
inline. Add a "Progress log" section at the bottom on first
iteration.
- One Step (or sub-deliverable for the big Steps 5-8) per iteration.
Implement. Test. Gate. Commit. Log. Push. Next.
Go. Read the plan. Run the restart baseline. Find the first unchecked
deliverable in Step 1. Implement it. Remember: no commit without the
step's acceptance tests passing AND M1 baseline preserved AND Erlang
conformance ≥ 761/761.