Files
rose-ash/plans/datalog-on-sx.md
giles 9bc70fd2a9
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 51s
datalog: db + naive eval + safety analysis (Phase 3, 87/87)
db.sx: facts indexed by relation name, rules list, dl-add-fact!
(rejects non-ground), dl-add-rule! (rejects unsafe — head vars
not in positive body). eval.sx: dl-saturate! fixpoint, dl-query
with deduped projected results. Negation and arithmetic raise
clear errors (Phase 4/7 to follow). 15 eval tests: transitive
closure, sibling, same-gen, grandparent, cyclic reach, safety.
2026-05-07 23:41:27 +00:00

10 KiB

Datalog-on-SX: Datalog on the CEK/VM

Datalog is a declarative query language: a restricted subset of Prolog with no function symbols, only relations. Programs are sets of facts and rules; queries ask what follows. Evaluation is bottom-up (fixpoint iteration) rather than Prolog's top-down DFS — which means no infinite loops, guaranteed termination, and efficient incremental updates.

The unique angle: Datalog is a natural companion to the Prolog implementation already in progress (lib/prolog/). The parser and term representation can share infrastructure; the evaluator is an entirely different fixpoint engine rather than a DFS solver.

End-state goal: full core Datalog (facts, rules, stratified negation, aggregation, recursion) with a clean SX query API, and a demonstration of Datalog as a query engine for rose-ash data (e.g. federation graph, content relationships).

Ground rules

  • Scope: only touch lib/datalog/** and plans/datalog-on-sx.md. Do not edit spec/, hosts/, shared/, lib/prolog/**, or other lib/<lang>/.
  • Shared-file issues go under "Blockers" below with a minimal repro; do not fix here.
  • SX files: use sx-tree MCP tools only.
  • Architecture: Datalog source → term AST → fixpoint evaluator. No transpiler to SX AST — the evaluator is written in SX and works directly on term structures.
  • Reference: Ramakrishnan & Ullman "A Survey of Deductive Database Systems"; Dalmau "Datalog and Constraint Satisfaction".
  • Commits: one feature per commit. Keep ## Progress log updated and tick boxes.

Architecture sketch

Datalog source text
    │
    ▼
lib/datalog/tokenizer.sx   — atoms, variables, numbers, strings, punct (?- :- , . ( ) [ ])
    │
    ▼
lib/datalog/parser.sx      — facts: atom(args). rules: head :- body. queries: ?- goal.
    │                        No function symbols (only constants and variables in args).
    ▼
lib/datalog/db.sx          — extensional DB (EDB): ground facts; IDB: derived relations;
    │                        clause index by relation name/arity
    ▼
lib/datalog/eval.sx        — bottom-up fixpoint: semi-naive evaluation with delta sets;
    │                        stratification for negation; incremental update API
    ▼
lib/datalog/query.sx       — query API: (datalog-query db goal) → list of substitutions;
                             SX embedding: define facts/rules as SX data directly

Key differences from Prolog:

  • No function symbols — args are atoms, numbers, strings, or variables only. No f(a,b).
  • No cuts — no procedural control.
  • Bottom-up — derive all consequences of all rules before answering; no search tree.
  • Termination guaranteed — no infinite derivation chains (no function symbols → finite Herbrand base).
  • Stratified negationnot(P) legal iff P does not recursively depend on its own negation.
  • Aggregationcount, sum, min, max over derived tuples (Datalog+).

Roadmap

Phase 1 — tokenizer + parser

  • Tokenizer: atoms (lowercase/quoted), variables (uppercase/_), numbers, strings, punct (( ), ,, .), operators (:-, ?-, <=, >=, !=, <, >, =, +, -, *, /), comments (%, /* */) Note: no function symbol syntax (no nested f(...) in arg position) — but the parser permits nested compounds for arithmetic; safety analysis (Phase 3) rejects non-arithmetic nesting.
  • Parser: - Facts: parent(tom, bob).{:head (parent tom bob) :body ()} - Rules: ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).{:head (ancestor X Z) :body ((parent X Y) (ancestor Y Z))} - Queries: ?- ancestor(tom, X).{:query ((ancestor tom X))} (:query value is always a list of literals; ?- p, q.{:query ((p) (q))}) - Negation: not(parent(X,Y)) in body position → {:neg (parent X Y)}
  • Tests in lib/datalog/tests/parse.sx (18) and lib/datalog/tests/tokenize.sx (26). Conformance harness: bash lib/datalog/conformance.sh → 44 / 44 passing.

Phase 2 — unification + substitution

  • Ported (not shared) from lib/prolog/ — term walk, no occurs check.
  • dl-unify t1 t2 subst → extended subst dict, or nil on failure.
  • dl-walk, dl-bind, dl-apply-subst, dl-ground?, dl-vars-of.
  • Substitutions are immutable dicts keyed by variable name (string). Lists/tuples unify element-wise (used for arithmetic compounds too).
  • Tests in lib/datalog/tests/unify.sx (28). 72 / 72 conformance.

Phase 3 — extensional DB + naive evaluation + safety analysis

  • EDB+IDB combined: {:facts {<rel-name-string> -> (literal ...)}} — relations indexed by name; tuples stored as full literals so they unify directly. Dedup on insert via dl-tuple-equal?.
  • dl-add-fact! db lit (rejects non-ground) and dl-add-rule! db rule (rejects unsafe). dl-program source parses + loads in one step.
  • Naive evaluation dl-saturate! db: iterate rules until no new tuples. dl-find-bindings recursively joins body literals; dl-match-positive unifies a literal against every tuple in the relation.
  • dl-query db goal → list of substitutions over goal's vars, deduplicated. dl-relation db name for derived tuples.
  • Safety analysis at dl-add-rule! time: every head variable except _ must appear in some positive body literal. Built-ins and negated literals do not satisfy safety. Helpers dl-positive-body-vars, dl-rule-unsafe-head-vars exposed for later phases.
  • Negation and arithmetic built-ins error cleanly at saturate time (Phase 4 / Phase 7 will swap in real semantics).
  • Tests in lib/datalog/tests/eval.sx (15): transitive closure, sibling, same-generation, grandparent, cyclic graph reach, six safety cases. 87 / 87 conformance.

Phase 4 — semi-naive evaluation (performance)

  • Delta sets: track newly derived tuples per iteration
  • Semi-naive rule: only join against delta tuples from last iteration, not full relation
  • Significant speedup for recursive rules — avoids re-deriving known tuples
  • dl-stratify db → dependency graph + SCC analysis → stratum ordering
  • Tests: verify semi-naive produces same results as naive; benchmark on large ancestor chain

Phase 5 — stratified negation

  • Dependency graph analysis: which relations depend on which (positively or negatively)
  • Stratification check: error if negation is in a cycle (non-stratifiable program)
  • Evaluation: process strata in order — lower stratum fully computed before using its complement in a higher stratum
  • not(P) in rule body: at evaluation time, check P is NOT in the derived EDB
  • Tests: non-member (not(member(X,L))), colored-graph (not(same-color(X,Y))), stratification error detection

Phase 6 — aggregation (Datalog+)

  • count(X, Goal) → number of distinct X satisfying Goal
  • sum(X, Goal) → sum of X values satisfying Goal
  • min(X, Goal) / max(X, Goal) → min/max of X satisfying Goal
  • group-by semantics: count(X, sibling(bob, X)) → count of bob's siblings
  • Aggregation breaks stratification — evaluate in a separate post-fixpoint pass
  • Tests: social network statistics, grade aggregation, inventory sums

Phase 7 — SX embedding API

  • (dl-program facts rules) → database from SX data directly (no parsing required) (dl-program '((parent tom bob) (parent tom liz) (parent bob ann)) '((ancestor X Z :- (parent X Y) (ancestor Y Z)) (ancestor X Y :- (parent X Y))))
  • (dl-query db '(ancestor tom ?X))((ann) (bob) (liz) (pat))
  • (dl-assert! db '(parent ann pat)) → incremental fact addition + re-derive
  • (dl-retract! db '(parent tom bob)) → fact removal + re-derive from scratch
  • Integration demo: federation graph query — (ancestor actor1 actor2) over rose-ash ActivityPub follow relationships

Phase 8 — Datalog as a query language for rose-ash

  • Schema: map SQLAlchemy model relationships to Datalog EDB facts (e.g. (follows user1 user2), (authored user post), (tagged post tag))
  • Loader: dl-load-from-db! — query PostgreSQL, populate Datalog EDB
  • Query examples: - ?- ancestor(me, X), authored(X, Post), tagged(Post, cooking). → posts about cooking by people I follow (transitively) - ?- popular(Post) :- tagged(Post, T), count(L, (liked(L, Post))) >= 10. → posts with 10+ likes
  • Expose as a rose-ash service endpoint: POST /internal/datalog with program + query

Blockers

(none yet)

Progress log

Newest first.

  • 2026-05-07 — Phase 3 done. lib/datalog/db.sx (~250 LOC) holds facts indexed by relation name plus the rules list, with dl-add-fact! / dl-add-rule! (rejects non-ground facts and unsafe rules); lib/datalog/eval.sx (~150 LOC) implements the naive bottom-up fixpoint via dl-find-bindings/dl-match-positive/dl-saturate! and dl-query (deduped projected substitutions). Safety analysis rejects unsafe head vars at load time. Negation and arithmetic built-ins raise clean errors (lifted in later phases). 15 eval tests cover transitive closure, sibling, same-generation, cyclic graph reach, and six safety violations. Conformance 87 / 87.

  • 2026-05-07 — Phase 2 done. lib/datalog/unify.sx (~140 LOC): dl-var? (case + underscore), dl-walk, dl-bind, dl-unify (returns extended dict subst or nil), dl-apply-subst, dl-ground?, dl-vars-of. Substitutions are immutable dicts; assoc builds extended copies. 28 unify tests; conformance now 72 / 72.

  • 2026-05-07 — Phase 1 done. lib/datalog/tokenizer.sx (~190 LOC) emits {:type :value :pos} tokens; lib/datalog/parser.sx (~150 LOC) produces {:head … :body …} / {:query …} clauses, with nested compounds permitted for arithmetic and not(...) desugared to {:neg …}. 44 / 44 via bash lib/datalog/conformance.sh (26 tokenize + 18 parse). Local helpers namespace-prefixed (dl-emit!, dl-peek) after a host-primitive shadow clash. Test harness uses a custom dl-deep-equal? that handles out-of-order dict keys and number repr (equal? fails on dict key order and on 30 vs 30.0).