fed-sx-m2: Step 8b-timer — live retry-loop wiring on send_after
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 44s

Wires the delivery_worker's retry loop on top of the
erlang:send_after / cancel_timer primitives just landed on
loops/erlang (3709460d, 98b0104c, 779e53b2 — cherry-picked here
since origin/architecture hasn't caught up yet).

Surface:
- new :timers [{Cid, Ref}] state field tracks live timer refs
- handle_call(flush): drain (existing semantics) + arm_retry_timer
  per retried Cid (computes backoff slot from the now-bumped attempt
  count, sets next_retry_at, send_after self-cast). Reply shape
  unchanged.
- handle_info({retry, Cid}, S): redrives that one Cid through
  deliver_one_pure. Success → record_success_pure + clear pending.
  Failure → schedule_retry_for (which bumps attempts, dead-letters on
  slot 6, or arms next slot).
- cancel_timer_for/2 before arming a new timer so stale timers don't
  keep the scheduler's run loop alive after the work is done.
- state_srv/1 + timer_ref_for/2 for test introspection.

5/5 in new delivery_retry_timer.sh; existing delivery_worker.sh
17/17 and delivery_retry.sh 11/11 still green. Conformance gate
771/771 (was 761/761; the +10 is the cherry-picked send_after
suite).

Closes Blockers #3. m2 is now feature-complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-30 14:05:31 +00:00
parent 779e53b2a8
commit 4da2a98c30
3 changed files with 327 additions and 9 deletions

View File

@@ -562,10 +562,24 @@ a dead-letter list visible via `/admin/dead-letter`.
is cleared from `:next_retry`. `record_success_pure` clears
both. `next_due_pure` returns cids whose retry time has
passed. 11 cases in `delivery_retry.sh`.
- [ ] **8b-timer** — Erlang-side timer wiring (`erlang:send_after`
self-cast or equivalent). Needs the same substrate primitive
that `gen_server` uses for `timeout` returns. Defer behind
substrate gap discovery for now — see Blockers.
- [x] **8b-timer** — Erlang-side timer wiring on the
`delivery_worker` gen_server. handle_call(flush) drains then
arms a `send_after` self-cast per retried Cid (backoff from
the now-bumped attempt counter); handle_info({retry, Cid})
redrives that single Cid through deliver_one_pure. Success
clears bookkeeping via record_success; failure bumps attempts
via record_failure_pure and arms the next backoff slot — or
promotes to dead-letter on the 6th attempt and stops arming.
A `:timers [{Cid, Ref}]` state field tracks live refs so
schedule_retry_for can cancel the previous one before arming
the next (otherwise stale timers keep the scheduler's run
loop alive long after the work is done). 5/5 in
`delivery_retry_timer.sh`: T1 timer scheduled, T2 attempts=1,
T3 retry fires + attempts=2, T4 next timer rearmed, T5 ets-
counter dispatch (fail/fail/ok) lands in 3 attempts and
clears state. Substrate dependency landed via cherry-pick
from `loops/erlang` (3709460d / 98b0104c / 779e53b2) until
`loops/erlang` → architecture catches up.
- [x] **8c** — Delivery-state projection
(`next/kernel/delivery_state.erl`). Folds delivery events into
per-peer worker-shaped snapshots so the outbound queue survives
@@ -1105,8 +1119,16 @@ proceed.
through `delivery_worker`) and Step 10c (peer-actor doc
fetch in `peer_actors`) are now unblocked.
3. **`erlang:send_after`-style timer primitive** — discovered
during Step 8b prep. The retry loop needs a way for the
3. **`erlang:send_after`-style timer primitive** — ~~discovered
during Step 8b prep~~ **RESOLVED 2026-06-30** via the
`loops/erlang` `send_after`/`cancel_timer`/`monotonic_time`
work landing on `origin/loops/erlang` (commits 3709460d,
98b0104c, b10e55f0; 766/766 → 771/771). m2 cherry-picked all
three onto this branch so 8b-timer could land without waiting
for `loops/erlang` → architecture; the cherry-picks fall away
as no-op duplicates when architecture catches up. Original
diagnosis preserved below for the audit trail.
The retry loop needs a way for the
delivery_worker to wake itself up after `backoff_for(N)`
seconds. Erlang's `erlang:send_after/3` is the standard
primitive; this port doesn't seem to register it (looked at
@@ -1241,6 +1263,31 @@ proceed.
Newest first.
- **2026-06-30** — Step 8b-timer closed. Cherry-picked the three
`loops/erlang` send_after commits onto m2 (3709460d, 98b0104c,
779e53b2 — the substrate landed standalone on origin/loops/erlang
earlier and hadn't propagated to origin/architecture yet). Wired
the live timer loop in `next/kernel/delivery_worker.erl`: a
`:timers [{Cid, Ref}]` state field; `handle_call(flush)` drains
then arms a `send_after` self-cast per retried Cid; the new
`handle_info({retry, Cid})` callback redrives that one Cid through
`deliver_one_pure` and either records success / clears state, or
bumps and arms the next backoff slot (or dead-letters on the 6th
attempt). Two arm-paths split — `arm_retry_timer` (post-drain,
attempts already bumped) vs `schedule_retry_for` (post-retry
attempt, needs to bump). `cancel_timer_for/1` clears the previous
timer before arming the next so stale timers don't keep the
scheduler's run loop alive after the work is done. Two new public
APIs for tests: `state_srv/1` returns the worker's full state,
`timer_ref_for/2` looks up a Cid's live ref. 5/5 in new
`delivery_retry_timer.sh` (T1 timer scheduled, T2 attempts=1, T3
retry fires + attempts=2, T4 next timer rearmed, T5 ets-counter
dispatch fail/fail/ok lands in 3 attempts and clears state).
Existing `delivery_worker.sh` 17/17 and `delivery_retry.sh` 11/11
still green. Conformance gate 771/771 (was 761/761; the +10 is
the cherry-picked send_after suite). Blockers #3 RESOLVED.
Reply shape of `flush` unchanged; no caller updates needed.
- **2026-06-28** — Merge-prep pass. Conformance 761/761 still green
on m2 tip `cd0de8cb`. Both smoke tests still pass cold:
`next/tests/smoke_kernel_route.sh` 6/6 (port 54471, listener up