Files
rose-ash/plans/datalog-on-sx.md
giles 8046df7ce5
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 53s
datalog: unification + substitution + 28 tests (Phase 2, 72/72)
dl-unify returns immutable extended subst dict (or nil). dl-walk
chases bindings, dl-apply-subst recursively resolves vars. Lists
unify element-wise so arithmetic compounds work too. dl-ground? and
dl-vars-of for safety analysis (Phase 3).
2026-05-07 23:34:35 +00:00

167 lines
8.9 KiB
Markdown

# Datalog-on-SX: Datalog on the CEK/VM
Datalog is a declarative query language: a restricted subset of Prolog with no function
symbols, only relations. Programs are sets of facts and rules; queries ask what follows.
Evaluation is bottom-up (fixpoint iteration) rather than Prolog's top-down DFS — which
means no infinite loops, guaranteed termination, and efficient incremental updates.
The unique angle: Datalog is a natural companion to the Prolog implementation already in
progress (`lib/prolog/`). The parser and term representation can share infrastructure;
the evaluator is an entirely different fixpoint engine rather than a DFS solver.
End-state goal: **full core Datalog** (facts, rules, stratified negation, aggregation,
recursion) with a clean SX query API, and a demonstration of Datalog as a query engine
for rose-ash data (e.g. federation graph, content relationships).
## Ground rules
- **Scope:** only touch `lib/datalog/**` and `plans/datalog-on-sx.md`. Do **not** edit
`spec/`, `hosts/`, `shared/`, `lib/prolog/**`, or other `lib/<lang>/`.
- **Shared-file issues** go under "Blockers" below with a minimal repro; do not fix here.
- **SX files:** use `sx-tree` MCP tools only.
- **Architecture:** Datalog source → term AST → fixpoint evaluator. No transpiler to SX AST —
the evaluator is written in SX and works directly on term structures.
- **Reference:** Ramakrishnan & Ullman "A Survey of Deductive Database Systems";
Dalmau "Datalog and Constraint Satisfaction".
- **Commits:** one feature per commit. Keep `## Progress log` updated and tick boxes.
## Architecture sketch
```
Datalog source text
lib/datalog/tokenizer.sx — atoms, variables, numbers, strings, punct (?- :- , . ( ) [ ])
lib/datalog/parser.sx — facts: atom(args). rules: head :- body. queries: ?- goal.
│ No function symbols (only constants and variables in args).
lib/datalog/db.sx — extensional DB (EDB): ground facts; IDB: derived relations;
│ clause index by relation name/arity
lib/datalog/eval.sx — bottom-up fixpoint: semi-naive evaluation with delta sets;
│ stratification for negation; incremental update API
lib/datalog/query.sx — query API: (datalog-query db goal) → list of substitutions;
SX embedding: define facts/rules as SX data directly
```
Key differences from Prolog:
- **No function symbols** — args are atoms, numbers, strings, or variables only. No `f(a,b)`.
- **No cuts** — no procedural control.
- **Bottom-up** — derive all consequences of all rules before answering; no search tree.
- **Termination guaranteed** — no infinite derivation chains (no function symbols → finite Herbrand base).
- **Stratified negation** — `not(P)` legal iff P does not recursively depend on its own negation.
- **Aggregation** — `count`, `sum`, `min`, `max` over derived tuples (Datalog+).
## Roadmap
### Phase 1 — tokenizer + parser
- [x] Tokenizer: atoms (lowercase/quoted), variables (uppercase/`_`), numbers, strings,
punct (`( )`, `,`, `.`), operators (`:-`, `?-`, `<=`, `>=`, `!=`, `<`, `>`, `=`,
`+`, `-`, `*`, `/`), comments (`%`, `/* */`)
Note: no function symbol syntax (no nested `f(...)` in arg position) — but the
parser permits nested compounds for arithmetic; safety analysis (Phase 3) rejects
non-arithmetic nesting.
- [x] Parser:
- Facts: `parent(tom, bob).``{:head (parent tom bob) :body ()}`
- Rules: `ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).`
`{:head (ancestor X Z) :body ((parent X Y) (ancestor Y Z))}`
- Queries: `?- ancestor(tom, X).``{:query ((ancestor tom X))}`
(`:query` value is always a list of literals; `?- p, q.``{:query ((p) (q))}`)
- Negation: `not(parent(X,Y))` in body position → `{:neg (parent X Y)}`
- [x] Tests in `lib/datalog/tests/parse.sx` (18) and `lib/datalog/tests/tokenize.sx` (26).
Conformance harness: `bash lib/datalog/conformance.sh` → 44 / 44 passing.
### Phase 2 — unification + substitution
- [x] Ported (not shared) from `lib/prolog/` — term walk, no occurs check.
- [x] `dl-unify t1 t2 subst` → extended subst dict, or `nil` on failure.
- [x] `dl-walk`, `dl-bind`, `dl-apply-subst`, `dl-ground?`, `dl-vars-of`.
- [x] Substitutions are immutable dicts keyed by variable name (string).
Lists/tuples unify element-wise (used for arithmetic compounds too).
- [x] Tests in `lib/datalog/tests/unify.sx` (28). 72 / 72 conformance.
### Phase 3 — extensional DB + naive evaluation
- [ ] EDB: `{:relation-name → set-of-ground-tuples}` using SX sets (Phase 18 of primitives)
- [ ] `dl-add-fact!` `db` `relation` `args` → add ground tuple
- [ ] `dl-add-rule!` `db` `head` `body` → add rule clause
- [ ] Naive evaluation: iterate rules until fixpoint
For each rule, for each combination of body tuples that unify, derive head tuple.
Repeat until no new tuples added.
- [ ] `dl-query` `db` `goal` → list of substitutions satisfying goal against derived DB
- [ ] Tests: transitive closure (ancestor), sibling, same-generation — classic Datalog programs
### Phase 4 — semi-naive evaluation (performance)
- [ ] Delta sets: track newly derived tuples per iteration
- [ ] Semi-naive rule: only join against delta tuples from last iteration, not full relation
- [ ] Significant speedup for recursive rules — avoids re-deriving known tuples
- [ ] `dl-stratify` `db` → dependency graph + SCC analysis → stratum ordering
- [ ] Tests: verify semi-naive produces same results as naive; benchmark on large ancestor chain
### Phase 5 — stratified negation
- [ ] Dependency graph analysis: which relations depend on which (positively or negatively)
- [ ] Stratification check: error if negation is in a cycle (non-stratifiable program)
- [ ] Evaluation: process strata in order — lower stratum fully computed before using its
complement in a higher stratum
- [ ] `not(P)` in rule body: at evaluation time, check P is NOT in the derived EDB
- [ ] Tests: non-member (`not(member(X,L))`), colored-graph (`not(same-color(X,Y))`),
stratification error detection
### Phase 6 — aggregation (Datalog+)
- [ ] `count(X, Goal)` → number of distinct X satisfying Goal
- [ ] `sum(X, Goal)` → sum of X values satisfying Goal
- [ ] `min(X, Goal)` / `max(X, Goal)` → min/max of X satisfying Goal
- [ ] `group-by` semantics: `count(X, sibling(bob, X))` → count of bob's siblings
- [ ] Aggregation breaks stratification — evaluate in a separate post-fixpoint pass
- [ ] Tests: social network statistics, grade aggregation, inventory sums
### Phase 7 — SX embedding API
- [ ] `(dl-program facts rules)` → database from SX data directly (no parsing required)
```
(dl-program
'((parent tom bob) (parent tom liz) (parent bob ann))
'((ancestor X Z :- (parent X Y) (ancestor Y Z))
(ancestor X Y :- (parent X Y))))
```
- [ ] `(dl-query db '(ancestor tom ?X))` → `((ann) (bob) (liz) (pat))`
- [ ] `(dl-assert! db '(parent ann pat))` → incremental fact addition + re-derive
- [ ] `(dl-retract! db '(parent tom bob))` → fact removal + re-derive from scratch
- [ ] Integration demo: federation graph query — `(ancestor actor1 actor2)` over
rose-ash ActivityPub follow relationships
### Phase 8 — Datalog as a query language for rose-ash
- [ ] Schema: map SQLAlchemy model relationships to Datalog EDB facts
(e.g. `(follows user1 user2)`, `(authored user post)`, `(tagged post tag)`)
- [ ] Loader: `dl-load-from-db!` — query PostgreSQL, populate Datalog EDB
- [ ] Query examples:
- `?- ancestor(me, X), authored(X, Post), tagged(Post, cooking).`
→ posts about cooking by people I follow (transitively)
- `?- popular(Post) :- tagged(Post, T), count(L, (liked(L, Post))) >= 10.`
→ posts with 10+ likes
- [ ] Expose as a rose-ash service endpoint: `POST /internal/datalog` with program + query
## Blockers
_(none yet)_
## Progress log
_Newest first._
- 2026-05-07 — Phase 2 done. `lib/datalog/unify.sx` (~140 LOC):
`dl-var?` (case + underscore), `dl-walk`, `dl-bind`, `dl-unify` (returns
extended dict subst or `nil`), `dl-apply-subst`, `dl-ground?`, `dl-vars-of`.
Substitutions are immutable dicts; `assoc` builds extended copies. 28
unify tests; conformance now 72 / 72.
- 2026-05-07 — Phase 1 done. `lib/datalog/tokenizer.sx` (~190 LOC) emits
`{:type :value :pos}` tokens; `lib/datalog/parser.sx` (~150 LOC) produces
`{:head … :body …}` / `{:query …}` clauses, with nested compounds
permitted for arithmetic and `not(...)` desugared to `{:neg …}`. 44 / 44
via `bash lib/datalog/conformance.sh` (26 tokenize + 18 parse). Local
helpers namespace-prefixed (`dl-emit!`, `dl-peek`) after a host-primitive
shadow clash. Test harness uses a custom `dl-deep-equal?` that handles
out-of-order dict keys and number repr (`equal?` fails on dict key order
and on `30` vs `30.0`).