Files
rose-ash/plans/otel-loop.md

11 KiB
Raw Blame History

OpenTelemetry in SX — loop briefing

Goal: self-hosting observability for the SX host — traces/spans/metrics in pure SX, a live SVG waterfall dashboard (reactive island), and OTLP-JSON export for interop with real backends (Jaeger/Grafana). Reference shape: nektro/zig-tracer src/otel.zig (the OTLP span struct + HTTP emit) — that's just the export step here.

The key insight — a TRACE is a COMPOSITION. A span has {name, start, end, parent, attrs}, so a trace is a tree of spans — the same shape as an object's :body composition. So reuse the existing fold machinery in lib/host/compose.sx (render-fold) and lib/host/execute.sx (execute-fold): a span is a timed effect; a waterfall is a render-fold over the span tree; OTLP export is an export-fold; metrics are an aggregate-fold. Don't reinvent — fold.

Base: this worktree is branched off loops/host (has the composition machinery + Parts A/C: type-block grammar + type-def editor). You are on branch loops/otel in /root/rose-ash-loops/otel.

Rules

  • Test-first. Write the failing test, then implement to green.
  • Fast tests via the warm server: bash lib/host/warm-conf.sh run <suite> (starts a warm persistent server; run alone = full conformance; eval "<expr>" for a REPL probe). New suite → add it to the runner the same way lib/host/tests/*.sx are wired.
  • Do NOT deploy to the live container. blog.rose-ash.com is bind-mounted from /root/rose-ash-loops/host (a different worktree). Build + test only; integration/deploy happens when this branch is merged. (If you want a live smoke, ask — don't recreate the shared container.)
  • .sx editing: prefer sx_write_file (validates on parse); if the sx-tree WRITE tools raise a yojson-null error in this worktree, fall back to the Write tool + sx_validate.
  • Commit each increment to loops/otel with a short factual message. Never push to main.
  • Cheap by construction: spans go in a bounded in-memory ring buffer, NOT the durable KV (persisting every span would hammer persist like the old relations/relate re-saturation bug). Sample + export on demand.

Roadmap — do ONE unchecked [ ] per iteration, test, commit, tick the box.

  • P1 — span model + API. lib/host/otel.sx: a span dict {:trace :span :parent :name :t0 :t1 :attrs :events}; otel/with-span name attrs thunk (records t0/t1, pushes/pops a dynamic parent stack so nesting builds the tree); a bounded ring buffer (otel/record!, otel/recent, cap ~1000, drop-oldest); otel/current-span/otel/current-trace. Tests: nested with-span builds parent links; ring caps at N.
  • P2 — monotonic clock. Find/confirm a time prim on the OCaml host (the warm-conf profiler + response cache already measure time; grep lib/host + the OCaml bridge). Wrap as otel/now-ns. Tests: monotonic non-decreasing, non-negative, a with-span has t1 >= t0.
  • P3 — auto-instrument the handlers. Wrap route handlers at the host/make-app / router seam (see lib/host/server.sx) so every HTTP request becomes a trace: a root span per request named by method+route, with {:http.method :http.route :http.status} attrs. Tests: a request through the app produces one trace with the right span name + status attr.
  • P4 — render-fold → SVG waterfall. A trace → an inline <svg> timeline: one <rect> per span, x ∝ (t0 trace.t0), width ∝ duration, y ∝ depth, a label. Reuse the compose-fold walk shape. Tests: N spans → N rects; nested spans get increasing y.
  • P5 — metrics (aggregate-fold). Fold recent spans → per-route counters (request count)
    • latency histogram (p50/p95/p99 from durations). Tests: known spans → expected counts + percentiles.
  • P6 — live dashboard. GET /otel — a reactive island (signals + an SSE stream of new traces) that renders the waterfall of the latest trace + the metrics strip, updating live without reload. Reuse the reactive runtime (sx/sx/reactive-runtime.sx, web/) + Dream SSE/streaming already in lib/host. Tests: the island SSRs; the SSE endpoint emits a span event; the page lists recent traces.
  • P7 — OTLP-JSON export. Serialize spans to the OTLP/JSON schema (resourceSpans → scopeSpans → spans with traceId/spanId/parentSpanId/name/startTimeUnixNano/endTimeUnixNano/ attributes). otel/export-otlp traces → the JSON; POST to an OTLP HTTP collector via an injected transport (so it's testable without a live collector). Tests: OTLP shape matches the spec for a known trace; the transport receives the payload.
  • P8 — context propagation + errors. Parse/emit the W3C traceparent header so a trace spans services (fed with the host's inter-service calls); mark error spans (:status :error
    • an event). Tests: traceparent round-trips; an error thunk yields an error span.

Progress log (newest first)

  • 2026-07-01 — P8 done — ROADMAP COMPLETE (P1P8, 124/124). otel/format-traceparent/otel/current-traceparent emit W3C 00-<32hex trace>-<16hex span>-01; otel/parse-traceparent{:version :trace-id :parent-id :flags :sampled}, nil on malformed/bad-width — round-trips. otel/-timed now GUARDS the thunk: success spans get top-level :status "ok" (attrs untouched), a raised error records a span with :status "error" + an {:name "exception" :message} event, pops the stack, and propagates. 20 new tests (traceparent round-trip + current + malformed; error span status/name/event/message + clean stack; success=ok). GOTCHA (saved to memory): an explicit (raise e) inside a guard handler RE-ENTERS the same guard and hangs — propagate instead via a clause whose TEST does the side-effect and returns false, letting R7RS guard auto-reraise to the outer handler.
  • 2026-07-01 — P7 done. otel/export-otlp spans folds → the OTLP/JSON envelope {:resourceSpans [{:resource … :scopeSpans [{:scope … :spans […]}]}]}; each span has hex traceId(32)/spanId(16)/parentSpanId (from otel/-pad-hex of the numeric id suffix via string->number+number->string _ 16), uint64-as-string startTimeUnixNano/endTimeUnixNano, typed attributes (number→intValue, else stringValue), and kind (2 SERVER if http.method, else 1 INTERNAL); root omits parentSpanId. otel/export-otlp-jsondream-json-encode. otel/post-otlp endpoint spans transport POSTs {:method :url :headers :body} through an INJECTED transport (tests pass a recorder; real deploy passes http POST). Suite 104/104 (26 new: nesting depth, hex widths+values, string timestamps, kinds, typed attrs, parentSpanId link, json+transport, empty envelope). All needed prims (string->number,number->stringradix,split,keys,assoc,has-key?,dream-json-encode) are real (not server-env), so conformance-safe.
  • 2026-07-01 — P6 done. GET /otel (otel/dashboard-route) SSRs otel/dashboard: metrics strip (table) + latest-trace waterfall <svg> + recent-traces <ul>, on a root carrying Datastar-style data-on-load="@get('/otel/stream')". GET /otel/stream (otel/stream-route) emits an SSE frame event: otel.span\ndata: <sxtp event>otel/span-event wraps a span as an SXTP event (the host's Datastar-borrowed wire format), otel/-stream-body frames the latest. Plus otel/recent-traces (newest-first {:trace :name :spans}) + otel/latest-trace. otel/routes mounts via make-app. Suite 78/78 (17 new: recent-traces order, SSR svg+strip+id+sub, SSE event-stream/framing/name, GET /otel via make-app, empty-ring placeholder). DECISION: SSR + declarative reactive attrs + SSE patches IS the reactive-island model here (sxtp = Datastar); SSRs via render-to-html (plain HTML tags, not render-page which is a server-env prim unavailable in conformance). Live client hydration = deploy concern, out of build+test scope.
  • 2026-07-01 — P5 done. otel/metrics spans{:total-requests N :routes (…)}; each route = {:route :count :p50 :p95 :p99}, route key = :http.route attr (falls back to span name). Nearest-rank percentiles (rank=ceil(p/100·N), 1-based) over per-route durations; needed a hand-rolled otel/-insert/otel/-sort-nums (no sort prim) + order-preserving otel/-distinct. otel/metrics-recent = over the ring. Suite 61/61 (11 new: total, 2 routes, feed count + p50=30/p95=50/p99=50 from [10..50], single-sample p50, sort helper, empty→zeroed). Note: / is float division here so ceil(p/100·N) is exact.
  • 2026-07-01 — P4 done. otel/waterfall-rects folds a trace's spans → rect geometry (x ∝ t0trace.t0, width ∝ duration, y ∝ depth via otel/-depth parent-link ancestor count; zero-dur spans get a 1px sliver). otel/waterfall folds those into an inline (svg … (g (rect …) (text …)) …) — one rect+label per span — which render-to-html emits as real SVG (verified: nested db span at y=22 below its GET /feed root at y=4). Suite 50/50 (13 new: rect-per-span, depth 0/1/2, increasing-y with nesting, positive widths, svg head + rect/label counts via otel/-tree-count, empty-trace). GOTCHA: this evaluator's quasiquote splice symbol is splice-unquote, NOT unquote-splicing (plain unquote is fine) — the wrong name serialised literally and produced 0 rects.
  • 2026-07-01 — P3 done. otel/instrument-routes wraps each flattened Dream route's handler in a timed span "METHOD /route" with {:http.method :http.route :http.status}; host/make-app applies it (seam) so every matched request is a trace. Refactored with-span onto a shared otel/-timed core with a finalize fn for result-derived attrs (http.status is only known post-handler; bare-string handler results coerced → 200). Suite 37/37; server 13/13 unchanged. NOTE: cold conformance.sh feed|relations|blog currently fail at test-file load with Undefined symbol: parse-safe/render-page — these are bind-registered server-env prims in sx_server.ml not resolving in the current shared binary's epoch context; pre-existing & environmental (reproduces with my P3 changes stashed), NOT caused by this work. otel/server/page suites unaffected.
  • 2026-07-01 — P2 done. Host time prim is clock-milliseconds (OCaml Unix.gettimeofday, epoch ms; no dedicated nano/monotonic prim). otel/now-ns wraps it as epoch NANOSECONDS (×1e6, the OTLP unit) with a high-water clamp so it never steps backwards → durations non-negative across NTP steps. P1 placeholder counter removed. Suite 23/23 (added: non-negative, monotonic non-decreasing, ns-scale, real with-span t1≥t0 + ns-scale t0).
  • 2026-07-01 — P1 done. lib/host/otel.sx: span dict + otel/with-span (dynamic parent stack builds the trace tree), monotonic id/clock placeholders (P2 replaces now-ns), bounded ring buffer (record!/recent/set-cap!, drop-oldest), current-span/current-trace, reset!. Suite lib/host/tests/otel.sx wired into conformance — 18/18 (nested parent links, attrs, ring caps at N drops oldest).
  • (append one dated line per iteration)