Files
rose-ash/plans/otel-loop.md

7.4 KiB
Raw Blame History

OpenTelemetry in SX — loop briefing

Goal: self-hosting observability for the SX host — traces/spans/metrics in pure SX, a live SVG waterfall dashboard (reactive island), and OTLP-JSON export for interop with real backends (Jaeger/Grafana). Reference shape: nektro/zig-tracer src/otel.zig (the OTLP span struct + HTTP emit) — that's just the export step here.

The key insight — a TRACE is a COMPOSITION. A span has {name, start, end, parent, attrs}, so a trace is a tree of spans — the same shape as an object's :body composition. So reuse the existing fold machinery in lib/host/compose.sx (render-fold) and lib/host/execute.sx (execute-fold): a span is a timed effect; a waterfall is a render-fold over the span tree; OTLP export is an export-fold; metrics are an aggregate-fold. Don't reinvent — fold.

Base: this worktree is branched off loops/host (has the composition machinery + Parts A/C: type-block grammar + type-def editor). You are on branch loops/otel in /root/rose-ash-loops/otel.

Rules

  • Test-first. Write the failing test, then implement to green.
  • Fast tests via the warm server: bash lib/host/warm-conf.sh run <suite> (starts a warm persistent server; run alone = full conformance; eval "<expr>" for a REPL probe). New suite → add it to the runner the same way lib/host/tests/*.sx are wired.
  • Do NOT deploy to the live container. blog.rose-ash.com is bind-mounted from /root/rose-ash-loops/host (a different worktree). Build + test only; integration/deploy happens when this branch is merged. (If you want a live smoke, ask — don't recreate the shared container.)
  • .sx editing: prefer sx_write_file (validates on parse); if the sx-tree WRITE tools raise a yojson-null error in this worktree, fall back to the Write tool + sx_validate.
  • Commit each increment to loops/otel with a short factual message. Never push to main.
  • Cheap by construction: spans go in a bounded in-memory ring buffer, NOT the durable KV (persisting every span would hammer persist like the old relations/relate re-saturation bug). Sample + export on demand.

Roadmap — do ONE unchecked [ ] per iteration, test, commit, tick the box.

  • P1 — span model + API. lib/host/otel.sx: a span dict {:trace :span :parent :name :t0 :t1 :attrs :events}; otel/with-span name attrs thunk (records t0/t1, pushes/pops a dynamic parent stack so nesting builds the tree); a bounded ring buffer (otel/record!, otel/recent, cap ~1000, drop-oldest); otel/current-span/otel/current-trace. Tests: nested with-span builds parent links; ring caps at N.
  • P2 — monotonic clock. Find/confirm a time prim on the OCaml host (the warm-conf profiler + response cache already measure time; grep lib/host + the OCaml bridge). Wrap as otel/now-ns. Tests: monotonic non-decreasing, non-negative, a with-span has t1 >= t0.
  • P3 — auto-instrument the handlers. Wrap route handlers at the host/make-app / router seam (see lib/host/server.sx) so every HTTP request becomes a trace: a root span per request named by method+route, with {:http.method :http.route :http.status} attrs. Tests: a request through the app produces one trace with the right span name + status attr.
  • P4 — render-fold → SVG waterfall. A trace → an inline <svg> timeline: one <rect> per span, x ∝ (t0 trace.t0), width ∝ duration, y ∝ depth, a label. Reuse the compose-fold walk shape. Tests: N spans → N rects; nested spans get increasing y.
  • P5 — metrics (aggregate-fold). Fold recent spans → per-route counters (request count)
    • latency histogram (p50/p95/p99 from durations). Tests: known spans → expected counts + percentiles.
  • P6 — live dashboard. GET /otel — a reactive island (signals + an SSE stream of new traces) that renders the waterfall of the latest trace + the metrics strip, updating live without reload. Reuse the reactive runtime (sx/sx/reactive-runtime.sx, web/) + Dream SSE/streaming already in lib/host. Tests: the island SSRs; the SSE endpoint emits a span event; the page lists recent traces.
  • P7 — OTLP-JSON export. Serialize spans to the OTLP/JSON schema (resourceSpans → scopeSpans → spans with traceId/spanId/parentSpanId/name/startTimeUnixNano/endTimeUnixNano/ attributes). otel/export-otlp traces → the JSON; POST to an OTLP HTTP collector via an injected transport (so it's testable without a live collector). Tests: OTLP shape matches the spec for a known trace; the transport receives the payload.
  • P8 — context propagation + errors. Parse/emit the W3C traceparent header so a trace spans services (fed with the host's inter-service calls); mark error spans (:status :error
    • an event). Tests: traceparent round-trips; an error thunk yields an error span.

Progress log (newest first)

  • 2026-07-01 — P4 done. otel/waterfall-rects folds a trace's spans → rect geometry (x ∝ t0trace.t0, width ∝ duration, y ∝ depth via otel/-depth parent-link ancestor count; zero-dur spans get a 1px sliver). otel/waterfall folds those into an inline (svg … (g (rect …) (text …)) …) — one rect+label per span — which render-to-html emits as real SVG (verified: nested db span at y=22 below its GET /feed root at y=4). Suite 50/50 (13 new: rect-per-span, depth 0/1/2, increasing-y with nesting, positive widths, svg head + rect/label counts via otel/-tree-count, empty-trace). GOTCHA: this evaluator's quasiquote splice symbol is splice-unquote, NOT unquote-splicing (plain unquote is fine) — the wrong name serialised literally and produced 0 rects.
  • 2026-07-01 — P3 done. otel/instrument-routes wraps each flattened Dream route's handler in a timed span "METHOD /route" with {:http.method :http.route :http.status}; host/make-app applies it (seam) so every matched request is a trace. Refactored with-span onto a shared otel/-timed core with a finalize fn for result-derived attrs (http.status is only known post-handler; bare-string handler results coerced → 200). Suite 37/37; server 13/13 unchanged. NOTE: cold conformance.sh feed|relations|blog currently fail at test-file load with Undefined symbol: parse-safe/render-page — these are bind-registered server-env prims in sx_server.ml not resolving in the current shared binary's epoch context; pre-existing & environmental (reproduces with my P3 changes stashed), NOT caused by this work. otel/server/page suites unaffected.
  • 2026-07-01 — P2 done. Host time prim is clock-milliseconds (OCaml Unix.gettimeofday, epoch ms; no dedicated nano/monotonic prim). otel/now-ns wraps it as epoch NANOSECONDS (×1e6, the OTLP unit) with a high-water clamp so it never steps backwards → durations non-negative across NTP steps. P1 placeholder counter removed. Suite 23/23 (added: non-negative, monotonic non-decreasing, ns-scale, real with-span t1≥t0 + ns-scale t0).
  • 2026-07-01 — P1 done. lib/host/otel.sx: span dict + otel/with-span (dynamic parent stack builds the trace tree), monotonic id/clock placeholders (P2 replaces now-ns), bounded ring buffer (record!/recent/set-cap!, drop-oldest), current-span/current-trace, reset!. Suite lib/host/tests/otel.sx wired into conformance — 18/18 (nested parent links, attrs, ring caps at N drops oldest).
  • (append one dated line per iteration)