# Datalog-on-SX: Datalog on the CEK/VM Datalog is a declarative query language: a restricted subset of Prolog with no function symbols, only relations. Programs are sets of facts and rules; queries ask what follows. Evaluation is bottom-up (fixpoint iteration) rather than Prolog's top-down DFS — which means no infinite loops, guaranteed termination, and efficient incremental updates. The unique angle: Datalog is a natural companion to the Prolog implementation already in progress (`lib/prolog/`). The parser and term representation can share infrastructure; the evaluator is an entirely different fixpoint engine rather than a DFS solver. End-state goal: **full core Datalog** (facts, rules, stratified negation, aggregation, recursion) with a clean SX query API, and a demonstration of Datalog as a query engine for rose-ash data (e.g. federation graph, content relationships). ## Ground rules - **Scope:** only touch `lib/datalog/**` and `plans/datalog-on-sx.md`. Do **not** edit `spec/`, `hosts/`, `shared/`, `lib/prolog/**`, or other `lib//`. - **Shared-file issues** go under "Blockers" below with a minimal repro; do not fix here. - **SX files:** use `sx-tree` MCP tools only. - **Architecture:** Datalog source → term AST → fixpoint evaluator. No transpiler to SX AST — the evaluator is written in SX and works directly on term structures. - **Reference:** Ramakrishnan & Ullman "A Survey of Deductive Database Systems"; Dalmau "Datalog and Constraint Satisfaction". - **Commits:** one feature per commit. Keep `## Progress log` updated and tick boxes. ## Architecture sketch ``` Datalog source text │ ▼ lib/datalog/tokenizer.sx — atoms, variables, numbers, strings, punct (?- :- , . ( ) [ ]) │ ▼ lib/datalog/parser.sx — facts: atom(args). rules: head :- body. queries: ?- goal. │ No function symbols (only constants and variables in args). ▼ lib/datalog/db.sx — extensional DB (EDB): ground facts; IDB: derived relations; │ clause index by relation name/arity ▼ lib/datalog/eval.sx — bottom-up fixpoint: semi-naive evaluation with delta sets; │ stratification for negation; incremental update API ▼ lib/datalog/query.sx — query API: (datalog-query db goal) → list of substitutions; SX embedding: define facts/rules as SX data directly ``` Key differences from Prolog: - **No function symbols** — args are atoms, numbers, strings, or variables only. No `f(a,b)`. - **No cuts** — no procedural control. - **Bottom-up** — derive all consequences of all rules before answering; no search tree. - **Termination guaranteed** — no infinite derivation chains (no function symbols → finite Herbrand base). - **Stratified negation** — `not(P)` legal iff P does not recursively depend on its own negation. - **Aggregation** — `count`, `sum`, `min`, `max` over derived tuples (Datalog+). ## Roadmap ### Phase 1 — tokenizer + parser - [x] Tokenizer: atoms (lowercase/quoted), variables (uppercase/`_`), numbers, strings, punct (`( )`, `,`, `.`), operators (`:-`, `?-`, `<=`, `>=`, `!=`, `<`, `>`, `=`, `+`, `-`, `*`, `/`), comments (`%`, `/* */`) Note: no function symbol syntax (no nested `f(...)` in arg position) — but the parser permits nested compounds for arithmetic; safety analysis (Phase 3) rejects non-arithmetic nesting. - [x] Parser: - Facts: `parent(tom, bob).` → `{:head (parent tom bob) :body ()}` - Rules: `ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).` → `{:head (ancestor X Z) :body ((parent X Y) (ancestor Y Z))}` - Queries: `?- ancestor(tom, X).` → `{:query ((ancestor tom X))}` (`:query` value is always a list of literals; `?- p, q.` → `{:query ((p) (q))}`) - Negation: `not(parent(X,Y))` in body position → `{:neg (parent X Y)}` - [x] Tests in `lib/datalog/tests/parse.sx` (18) and `lib/datalog/tests/tokenize.sx` (26). Conformance harness: `bash lib/datalog/conformance.sh` → 44 / 44 passing. ### Phase 2 — unification + substitution - [x] Ported (not shared) from `lib/prolog/` — term walk, no occurs check. - [x] `dl-unify t1 t2 subst` → extended subst dict, or `nil` on failure. - [x] `dl-walk`, `dl-bind`, `dl-apply-subst`, `dl-ground?`, `dl-vars-of`. - [x] Substitutions are immutable dicts keyed by variable name (string). Lists/tuples unify element-wise (used for arithmetic compounds too). - [x] Tests in `lib/datalog/tests/unify.sx` (28). 72 / 72 conformance. ### Phase 3 — extensional DB + naive evaluation + safety analysis - [x] EDB+IDB combined: `{:facts { -> (literal ...)}}` — relations indexed by name; tuples stored as full literals so they unify directly. Dedup on insert via `dl-tuple-equal?`. - [x] `dl-add-fact! db lit` (rejects non-ground) and `dl-add-rule! db rule` (rejects unsafe). `dl-program source` parses + loads in one step. - [x] Naive evaluation `dl-saturate! db`: iterate rules until no new tuples. `dl-find-bindings` recursively joins body literals; `dl-match-positive` unifies a literal against every tuple in the relation. - [x] `dl-query db goal` → list of substitutions over `goal`'s vars, deduplicated. `dl-relation db name` for derived tuples. - [x] Safety analysis at `dl-add-rule!` time: every head variable except `_` must appear in some positive body literal. Built-ins and negated literals do not satisfy safety. Helpers `dl-positive-body-vars`, `dl-rule-unsafe-head-vars` exposed for later phases. - [x] Negation and arithmetic built-ins error cleanly at saturate time (Phase 4 / Phase 7 will swap in real semantics). - [x] Tests in `lib/datalog/tests/eval.sx` (15): transitive closure, sibling, same-generation, grandparent, cyclic graph reach, six safety cases. 87 / 87 conformance. ### Phase 4 — built-in predicates + body arithmetic Almost every real query needs `<`, `=`, simple arithmetic, and string comparisons in body position. These are not EDB lookups — they're constraints that filter bindings. - [ ] Recognise built-in predicates in body: `(< X Y)`, `(<= X Y)`, `(> X Y)`, `(>= X Y)`, `(= X Y)`, `(!= X Y)` and arithmetic forms `(is Z (+ X Y))`, `(is Z (- X Y))`, `(is Z (* X Y))`, `(is Z (/ X Y))`. - [ ] Built-in evaluation: at the join step, after binding variables from EDB lookups, evaluate built-ins as constraints. If any built-in fails or has unbound inputs, drop the candidate substitution. - [ ] **Safety extension**: `is` binds its left operand iff right operand is fully ground. `(< X Y)` requires both X and Y bound by some prior body literal — reject unsafe at `dl-add-rule!` time. - [ ] Wire arithmetic operators through to SX numeric primitives — no separate Datalog number tower. - [ ] Tests: range filters, arithmetic derivations, comparison-based queries, safety violation on `(p X) :- (< X 5).` ### Phase 5 — semi-naive evaluation (performance) - [ ] Delta sets: track newly derived tuples per iteration - [ ] Semi-naive rule: only join against delta tuples from last iteration, not full relation - [ ] Significant speedup for recursive rules — avoids re-deriving known tuples - [ ] Tests: verify semi-naive produces same results as naive; benchmark on large ancestor chain ### Phase 6 — magic sets (goal-directed bottom-up, opt-in) Naive bottom-up derives **all** consequences before answering. Magic sets rewrite the program so the fixpoint only derives tuples relevant to the goal — a major perf win for "what's reachable from node X" queries on large graphs. - [ ] Adornments: annotate rule predicates with bound (`b`) / free (`f`) patterns based on how they're called. - [ ] Magic transformation: for each adorned predicate, generate a `magic_` relation and rewrite rule bodies to filter through it. - [ ] Sideways information passing strategy (SIPS): left-to-right by default; pluggable. - [ ] Optional pass — `(dl-set-strategy! db :magic)`; default semi-naive. - [ ] Tests: equivalence vs naive on small inputs; perf win on a 10k-node reachability query from a single root. ### Phase 7 — stratified negation - [ ] Dependency graph analysis: which relations depend on which (positively or negatively) - [ ] Stratification check: error if negation is in a cycle (non-stratifiable program) - [ ] `dl-stratify db` → SCC analysis → stratum ordering - [ ] Evaluation: process strata in order — lower stratum fully computed before using its complement in a higher stratum - [ ] `not(P)` in rule body: at evaluation time, check P is NOT in the derived EDB - [ ] Safety extension: head vars in negative literals must also appear in some positive body literal of the same rule - [ ] Tests: non-member (`not(member(X,L))`), colored-graph (`not(same-color(X,Y))`), stratification error detection ### Phase 8 — aggregation (Datalog+) - [ ] `count(X, Goal)` → number of distinct X satisfying Goal - [ ] `sum(X, Goal)` → sum of X values satisfying Goal - [ ] `min(X, Goal)` / `max(X, Goal)` → min/max of X satisfying Goal - [ ] `group-by` semantics: `count(X, sibling(bob, X))` → count of bob's siblings - [ ] Aggregation breaks stratification — evaluate in a separate post-fixpoint pass - [ ] Tests: social network statistics, grade aggregation, inventory sums ### Phase 9 — SX embedding API - [ ] `(dl-program facts rules)` → database from SX data directly (no parsing required) ``` (dl-program '((parent tom bob) (parent tom liz) (parent bob ann)) '((ancestor X Z :- (parent X Y) (ancestor Y Z)) (ancestor X Y :- (parent X Y)))) ``` - [ ] `(dl-query db '(ancestor tom ?X))` → `((ann) (bob) (liz) (pat))` - [ ] `(dl-assert! db '(parent ann pat))` → incremental fact addition + re-derive - [ ] `(dl-retract! db '(parent tom bob))` → fact removal + re-derive from scratch - [ ] Integration demo: federation graph query — `(ancestor actor1 actor2)` over rose-ash ActivityPub follow relationships ### Phase 10 — Datalog as a query language for rose-ash - [ ] Schema: map SQLAlchemy model relationships to Datalog EDB facts (e.g. `(follows user1 user2)`, `(authored user post)`, `(tagged post tag)`) - [ ] Loader: `dl-load-from-db!` — query PostgreSQL, populate Datalog EDB - [ ] Query examples: - `?- ancestor(me, X), authored(X, Post), tagged(Post, cooking).` → posts about cooking by people I follow (transitively) - `?- popular(Post) :- tagged(Post, T), count(L, (liked(L, Post))) >= 10.` → posts with 10+ likes - [ ] Expose as a rose-ash service endpoint: `POST /internal/datalog` with program + query ## Blockers _(none yet)_ ## Progress log _Newest first._ - 2026-05-07 — Phase 3 done. `lib/datalog/db.sx` (~250 LOC) holds facts indexed by relation name plus the rules list, with `dl-add-fact!` / `dl-add-rule!` (rejects non-ground facts and unsafe rules); `lib/datalog/eval.sx` (~150 LOC) implements the naive bottom-up fixpoint via `dl-find-bindings`/`dl-match-positive`/`dl-saturate!` and `dl-query` (deduped projected substitutions). Safety analysis rejects unsafe head vars at load time. Negation and arithmetic built-ins raise clean errors (lifted in later phases). 15 eval tests cover transitive closure, sibling, same-generation, cyclic graph reach, and six safety violations. Conformance 87 / 87. - 2026-05-07 — Phase 2 done. `lib/datalog/unify.sx` (~140 LOC): `dl-var?` (case + underscore), `dl-walk`, `dl-bind`, `dl-unify` (returns extended dict subst or `nil`), `dl-apply-subst`, `dl-ground?`, `dl-vars-of`. Substitutions are immutable dicts; `assoc` builds extended copies. 28 unify tests; conformance now 72 / 72. - 2026-05-07 — Phase 1 done. `lib/datalog/tokenizer.sx` (~190 LOC) emits `{:type :value :pos}` tokens; `lib/datalog/parser.sx` (~150 LOC) produces `{:head … :body …}` / `{:query …}` clauses, with nested compounds permitted for arithmetic and `not(...)` desugared to `{:neg …}`. 44 / 44 via `bash lib/datalog/conformance.sh` (26 tokenize + 18 parse). Local helpers namespace-prefixed (`dl-emit!`, `dl-peek`) after a host-primitive shadow clash. Test harness uses a custom `dl-deep-equal?` that handles out-of-order dict keys and number repr (`equal?` fails on dict key order and on `30` vs `30.0`).