Files
rose-ash/plans/otel-loop.md
2026-07-01 18:20:46 +00:00

72 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OpenTelemetry in SX — loop briefing
**Goal:** self-hosting observability for the SX host — traces/spans/metrics in **pure SX**, a
**live SVG waterfall dashboard** (reactive island), and **OTLP-JSON export** for interop with
real backends (Jaeger/Grafana). Reference shape: nektro/zig-tracer `src/otel.zig` (the OTLP span
struct + HTTP emit) — that's just the export step here.
**The key insight — a TRACE is a COMPOSITION.** A span has `{name, start, end, parent, attrs}`,
so a trace is a *tree of spans* — the same shape as an object's `:body` composition. So reuse the
existing fold machinery in `lib/host/compose.sx` (render-fold) and `lib/host/execute.sx`
(execute-fold): a span is a *timed effect*; a waterfall is a *render-fold over the span tree*;
OTLP export is an *export-fold*; metrics are an *aggregate-fold*. Don't reinvent — fold.
**Base:** this worktree is branched off `loops/host` (has the composition machinery + Parts A/C:
type-block grammar + type-def editor). You are on branch `loops/otel` in
`/root/rose-ash-loops/otel`.
## Rules
- **Test-first.** Write the failing test, then implement to green.
- **Fast tests via the warm server:** `bash lib/host/warm-conf.sh run <suite>` (starts a warm
persistent server; `run` alone = full conformance; `eval "<expr>"` for a REPL probe). New suite
→ add it to the runner the same way `lib/host/tests/*.sx` are wired.
- **Do NOT deploy to the live container.** blog.rose-ash.com is bind-mounted from
`/root/rose-ash-loops/host` (a *different* worktree). Build + test only; integration/deploy
happens when this branch is merged. (If you want a live smoke, ask — don't recreate the shared
container.)
- **`.sx` editing:** prefer `sx_write_file` (validates on parse); if the sx-tree WRITE tools raise
a yojson-null error in this worktree, fall back to the `Write` tool + `sx_validate`.
- Commit each increment to `loops/otel` with a short factual message. Never push to `main`.
- **Cheap by construction:** spans go in a **bounded in-memory ring buffer**, NOT the durable KV
(persisting every span would hammer persist like the old `relations/relate` re-saturation bug).
Sample + export on demand.
## Roadmap — do ONE unchecked `[ ]` per iteration, test, commit, tick the box.
- [x] **P1 — span model + API.** `lib/host/otel.sx`: a span dict `{:trace :span :parent :name
:t0 :t1 :attrs :events}`; `otel/with-span name attrs thunk` (records t0/t1, pushes/pops a
dynamic parent stack so nesting builds the tree); a bounded ring buffer (`otel/record!`,
`otel/recent`, cap ~1000, drop-oldest); `otel/current-span`/`otel/current-trace`. Tests:
nested with-span builds parent links; ring caps at N.
- [x] **P2 — monotonic clock.** Find/confirm a time prim on the OCaml host (the warm-conf
profiler + response cache already measure time; grep `lib/host` + the OCaml bridge). Wrap as
`otel/now-ns`. Tests: monotonic non-decreasing, non-negative, a `with-span` has `t1 >= t0`.
- [ ] **P3 — auto-instrument the handlers.** Wrap route handlers at the `host/make-app` / router
seam (see `lib/host/server.sx`) so every HTTP request becomes a trace: a root span per request
named by method+route, with `{:http.method :http.route :http.status}` attrs. Tests: a request
through the app produces one trace with the right span name + status attr.
- [ ] **P4 — render-fold → SVG waterfall.** A trace → an inline `<svg>` timeline: one `<rect>`
per span, `x` ∝ (t0 trace.t0), `width` ∝ duration, `y` ∝ depth, a label. Reuse the
compose-fold walk shape. Tests: N spans → N rects; nested spans get increasing y.
- [ ] **P5 — metrics (aggregate-fold).** Fold recent spans → per-route counters (request count)
+ latency histogram (p50/p95/p99 from durations). Tests: known spans → expected counts +
percentiles.
- [ ] **P6 — live dashboard.** `GET /otel` — a reactive island (signals + an SSE stream of new
traces) that renders the waterfall of the latest trace + the metrics strip, updating live
without reload. Reuse the reactive runtime (`sx/sx/reactive-runtime.sx`, `web/`) + Dream
SSE/streaming already in `lib/host`. Tests: the island SSRs; the SSE endpoint emits a span
event; the page lists recent traces.
- [ ] **P7 — OTLP-JSON export.** Serialize spans to the OTLP/JSON schema (resourceSpans →
scopeSpans → spans with traceId/spanId/parentSpanId/name/startTimeUnixNano/endTimeUnixNano/
attributes). `otel/export-otlp traces` → the JSON; POST to an OTLP HTTP collector via an
**injected transport** (so it's testable without a live collector). Tests: OTLP shape matches
the spec for a known trace; the transport receives the payload.
- [ ] **P8 — context propagation + errors.** Parse/emit the W3C `traceparent` header so a trace
spans services (fed with the host's inter-service calls); mark error spans (`:status :error`
+ an event). Tests: traceparent round-trips; an error thunk yields an error span.
## Progress log (newest first)
- 2026-07-01 — P2 done. Host time prim is `clock-milliseconds` (OCaml `Unix.gettimeofday`, epoch ms; no dedicated nano/monotonic prim). `otel/now-ns` wraps it as epoch NANOSECONDS (×1e6, the OTLP unit) with a high-water clamp so it never steps backwards → durations non-negative across NTP steps. P1 placeholder counter removed. Suite 23/23 (added: non-negative, monotonic non-decreasing, ns-scale, real with-span t1≥t0 + ns-scale t0).
- 2026-07-01 — P1 done. `lib/host/otel.sx`: span dict + `otel/with-span` (dynamic parent stack builds the trace tree), monotonic id/clock placeholders (P2 replaces now-ns), bounded ring buffer (`record!`/`recent`/`set-cap!`, drop-oldest), `current-span`/`current-trace`, `reset!`. Suite `lib/host/tests/otel.sx` wired into conformance — 18/18 (nested parent links, attrs, ring caps at N drops oldest).
- (append one dated line per iteration)