78 lines
11 KiB
Markdown
78 lines
11 KiB
Markdown
# OpenTelemetry in SX — loop briefing
|
||
|
||
**Goal:** self-hosting observability for the SX host — traces/spans/metrics in **pure SX**, a
|
||
**live SVG waterfall dashboard** (reactive island), and **OTLP-JSON export** for interop with
|
||
real backends (Jaeger/Grafana). Reference shape: nektro/zig-tracer `src/otel.zig` (the OTLP span
|
||
struct + HTTP emit) — that's just the export step here.
|
||
|
||
**The key insight — a TRACE is a COMPOSITION.** A span has `{name, start, end, parent, attrs}`,
|
||
so a trace is a *tree of spans* — the same shape as an object's `:body` composition. So reuse the
|
||
existing fold machinery in `lib/host/compose.sx` (render-fold) and `lib/host/execute.sx`
|
||
(execute-fold): a span is a *timed effect*; a waterfall is a *render-fold over the span tree*;
|
||
OTLP export is an *export-fold*; metrics are an *aggregate-fold*. Don't reinvent — fold.
|
||
|
||
**Base:** this worktree is branched off `loops/host` (has the composition machinery + Parts A/C:
|
||
type-block grammar + type-def editor). You are on branch `loops/otel` in
|
||
`/root/rose-ash-loops/otel`.
|
||
|
||
## Rules
|
||
- **Test-first.** Write the failing test, then implement to green.
|
||
- **Fast tests via the warm server:** `bash lib/host/warm-conf.sh run <suite>` (starts a warm
|
||
persistent server; `run` alone = full conformance; `eval "<expr>"` for a REPL probe). New suite
|
||
→ add it to the runner the same way `lib/host/tests/*.sx` are wired.
|
||
- **Do NOT deploy to the live container.** blog.rose-ash.com is bind-mounted from
|
||
`/root/rose-ash-loops/host` (a *different* worktree). Build + test only; integration/deploy
|
||
happens when this branch is merged. (If you want a live smoke, ask — don't recreate the shared
|
||
container.)
|
||
- **`.sx` editing:** prefer `sx_write_file` (validates on parse); if the sx-tree WRITE tools raise
|
||
a yojson-null error in this worktree, fall back to the `Write` tool + `sx_validate`.
|
||
- Commit each increment to `loops/otel` with a short factual message. Never push to `main`.
|
||
- **Cheap by construction:** spans go in a **bounded in-memory ring buffer**, NOT the durable KV
|
||
(persisting every span would hammer persist like the old `relations/relate` re-saturation bug).
|
||
Sample + export on demand.
|
||
|
||
## Roadmap — do ONE unchecked `[ ]` per iteration, test, commit, tick the box.
|
||
|
||
- [x] **P1 — span model + API.** `lib/host/otel.sx`: a span dict `{:trace :span :parent :name
|
||
:t0 :t1 :attrs :events}`; `otel/with-span name attrs thunk` (records t0/t1, pushes/pops a
|
||
dynamic parent stack so nesting builds the tree); a bounded ring buffer (`otel/record!`,
|
||
`otel/recent`, cap ~1000, drop-oldest); `otel/current-span`/`otel/current-trace`. Tests:
|
||
nested with-span builds parent links; ring caps at N.
|
||
- [x] **P2 — monotonic clock.** Find/confirm a time prim on the OCaml host (the warm-conf
|
||
profiler + response cache already measure time; grep `lib/host` + the OCaml bridge). Wrap as
|
||
`otel/now-ns`. Tests: monotonic non-decreasing, non-negative, a `with-span` has `t1 >= t0`.
|
||
- [x] **P3 — auto-instrument the handlers.** Wrap route handlers at the `host/make-app` / router
|
||
seam (see `lib/host/server.sx`) so every HTTP request becomes a trace: a root span per request
|
||
named by method+route, with `{:http.method :http.route :http.status}` attrs. Tests: a request
|
||
through the app produces one trace with the right span name + status attr.
|
||
- [x] **P4 — render-fold → SVG waterfall.** A trace → an inline `<svg>` timeline: one `<rect>`
|
||
per span, `x` ∝ (t0 − trace.t0), `width` ∝ duration, `y` ∝ depth, a label. Reuse the
|
||
compose-fold walk shape. Tests: N spans → N rects; nested spans get increasing y.
|
||
- [x] **P5 — metrics (aggregate-fold).** Fold recent spans → per-route counters (request count)
|
||
+ latency histogram (p50/p95/p99 from durations). Tests: known spans → expected counts +
|
||
percentiles.
|
||
- [x] **P6 — live dashboard.** `GET /otel` — a reactive island (signals + an SSE stream of new
|
||
traces) that renders the waterfall of the latest trace + the metrics strip, updating live
|
||
without reload. Reuse the reactive runtime (`sx/sx/reactive-runtime.sx`, `web/`) + Dream
|
||
SSE/streaming already in `lib/host`. Tests: the island SSRs; the SSE endpoint emits a span
|
||
event; the page lists recent traces.
|
||
- [x] **P7 — OTLP-JSON export.** Serialize spans to the OTLP/JSON schema (resourceSpans →
|
||
scopeSpans → spans with traceId/spanId/parentSpanId/name/startTimeUnixNano/endTimeUnixNano/
|
||
attributes). `otel/export-otlp traces` → the JSON; POST to an OTLP HTTP collector via an
|
||
**injected transport** (so it's testable without a live collector). Tests: OTLP shape matches
|
||
the spec for a known trace; the transport receives the payload.
|
||
- [x] **P8 — context propagation + errors.** Parse/emit the W3C `traceparent` header so a trace
|
||
spans services (fed with the host's inter-service calls); mark error spans (`:status :error`
|
||
+ an event). Tests: traceparent round-trips; an error thunk yields an error span.
|
||
|
||
## Progress log (newest first)
|
||
- 2026-07-01 — P8 done — **ROADMAP COMPLETE (P1–P8, 124/124)**. `otel/format-traceparent`/`otel/current-traceparent` emit W3C `00-<32hex trace>-<16hex span>-01`; `otel/parse-traceparent` → `{:version :trace-id :parent-id :flags :sampled}`, nil on malformed/bad-width — round-trips. `otel/-timed` now GUARDS the thunk: success spans get top-level `:status "ok"` (attrs untouched), a raised error records a span with `:status "error"` + an `{:name "exception" :message}` event, pops the stack, and propagates. 20 new tests (traceparent round-trip + current + malformed; error span status/name/event/message + clean stack; success=ok). GOTCHA (saved to memory): an explicit `(raise e)` inside a guard handler RE-ENTERS the same guard and hangs — propagate instead via a clause whose TEST does the side-effect and returns `false`, letting R7RS guard auto-reraise to the outer handler.
|
||
- 2026-07-01 — P7 done. `otel/export-otlp spans` folds → the OTLP/JSON envelope `{:resourceSpans [{:resource … :scopeSpans [{:scope … :spans […]}]}]}`; each span has hex `traceId`(32)/`spanId`(16)/`parentSpanId` (from `otel/-pad-hex` of the numeric id suffix via `string->number`+`number->string _ 16`), uint64-as-string `startTimeUnixNano`/`endTimeUnixNano`, typed `attributes` (number→`intValue`, else `stringValue`), and `kind` (2 SERVER if http.method, else 1 INTERNAL); root omits `parentSpanId`. `otel/export-otlp-json` → `dream-json-encode`. `otel/post-otlp endpoint spans transport` POSTs `{:method :url :headers :body}` through an INJECTED transport (tests pass a recorder; real deploy passes http POST). Suite 104/104 (26 new: nesting depth, hex widths+values, string timestamps, kinds, typed attrs, parentSpanId link, json+transport, empty envelope). All needed prims (`string->number`,`number->string`radix,`split`,`keys`,`assoc`,`has-key?`,`dream-json-encode`) are real (not server-env), so conformance-safe.
|
||
- 2026-07-01 — P6 done. `GET /otel` (`otel/dashboard-route`) SSRs `otel/dashboard`: metrics strip (table) + latest-trace waterfall `<svg>` + recent-traces `<ul>`, on a root carrying Datastar-style `data-on-load="@get('/otel/stream')"`. `GET /otel/stream` (`otel/stream-route`) emits an SSE frame `event: otel.span\ndata: <sxtp event>` — `otel/span-event` wraps a span as an SXTP `event` (the host's Datastar-borrowed wire format), `otel/-stream-body` frames the latest. Plus `otel/recent-traces` (newest-first {:trace :name :spans}) + `otel/latest-trace`. `otel/routes` mounts via make-app. Suite 78/78 (17 new: recent-traces order, SSR svg+strip+id+sub, SSE event-stream/framing/name, GET /otel via make-app, empty-ring placeholder). DECISION: SSR + declarative reactive attrs + SSE patches IS the reactive-island model here (sxtp = Datastar); SSRs via `render-to-html` (plain HTML tags, not `render-page` which is a server-env prim unavailable in conformance). Live client hydration = deploy concern, out of build+test scope.
|
||
- 2026-07-01 — P5 done. `otel/metrics spans` → `{:total-requests N :routes (…)}`; each route = `{:route :count :p50 :p95 :p99}`, route key = `:http.route` attr (falls back to span name). Nearest-rank percentiles (rank=ceil(p/100·N), 1-based) over per-route durations; needed a hand-rolled `otel/-insert`/`otel/-sort-nums` (no `sort` prim) + order-preserving `otel/-distinct`. `otel/metrics-recent` = over the ring. Suite 61/61 (11 new: total, 2 routes, feed count + p50=30/p95=50/p99=50 from [10..50], single-sample p50, sort helper, empty→zeroed). Note: `/` is float division here so `ceil(p/100·N)` is exact.
|
||
- 2026-07-01 — P4 done. `otel/waterfall-rects` folds a trace's spans → rect geometry (x ∝ t0−trace.t0, width ∝ duration, y ∝ depth via `otel/-depth` parent-link ancestor count; zero-dur spans get a 1px sliver). `otel/waterfall` folds those into an inline `(svg … (g (rect …) (text …)) …)` — one rect+label per span — which `render-to-html` emits as real SVG (verified: nested `db` span at y=22 below its `GET /feed` root at y=4). Suite 50/50 (13 new: rect-per-span, depth 0/1/2, increasing-y with nesting, positive widths, svg head + rect/label counts via `otel/-tree-count`, empty-trace). GOTCHA: this evaluator's quasiquote splice symbol is `splice-unquote`, NOT `unquote-splicing` (plain `unquote` is fine) — the wrong name serialised literally and produced 0 rects.
|
||
- 2026-07-01 — P3 done. `otel/instrument-routes` wraps each flattened Dream route's handler in a timed span "METHOD /route" with `{:http.method :http.route :http.status}`; `host/make-app` applies it (seam) so every matched request is a trace. Refactored `with-span` onto a shared `otel/-timed` core with a `finalize` fn for result-derived attrs (http.status is only known post-handler; bare-string handler results coerced → 200). Suite 37/37; server 13/13 unchanged. NOTE: cold `conformance.sh feed|relations|blog` currently fail at test-file load with `Undefined symbol: parse-safe`/`render-page` — these are `bind`-registered server-env prims in `sx_server.ml` not resolving in the current shared binary's epoch context; **pre-existing & environmental** (reproduces with my P3 changes stashed), NOT caused by this work. otel/server/page suites unaffected.
|
||
- 2026-07-01 — P2 done. Host time prim is `clock-milliseconds` (OCaml `Unix.gettimeofday`, epoch ms; no dedicated nano/monotonic prim). `otel/now-ns` wraps it as epoch NANOSECONDS (×1e6, the OTLP unit) with a high-water clamp so it never steps backwards → durations non-negative across NTP steps. P1 placeholder counter removed. Suite 23/23 (added: non-negative, monotonic non-decreasing, ns-scale, real with-span t1≥t0 + ns-scale t0).
|
||
- 2026-07-01 — P1 done. `lib/host/otel.sx`: span dict + `otel/with-span` (dynamic parent stack builds the trace tree), monotonic id/clock placeholders (P2 replaces now-ns), bounded ring buffer (`record!`/`recent`/`set-cap!`, drop-oldest), `current-span`/`current-trace`, `reset!`. Suite `lib/host/tests/otel.sx` wired into conformance — 18/18 (nested parent links, attrs, ring caps at N drops oldest).
|
||
- (append one dated line per iteration)
|