Files
rose-ash/plans/search-on-sx.md
giles c3a0727645
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 19s
plans: five rose-ash subsystem plans + three loop briefings
Plans for acl-on-sx (Datalog), flow-on-sx (Scheme), feed-on-sx (APL),
mod-on-sx (Prolog), search-on-sx (Haskell). Each is a 4-phase queue
sitting on its respective guest language, targeting rose-ash needs:
access control, durable workflows, activity feeds, moderation, search.
Federation extension in Phase 4 of each (plugs into fed-sx).

Briefings for the three loops we're kicking off now: acl-loop,
flow-loop, feed-loop. mod-sx and search-sx briefings will follow
once the first three have surfaced any shared infrastructure
worth extracting to lib/guest/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-06 15:55:39 +00:00

107 lines
4.6 KiB
Markdown

# search-on-sx: Full-text + structured search on Haskell
rose-ash needs search across pages, posts, threads, federated content. Tokenize,
index, query, rank, filter by visibility. Typed ADTs make query parsing clean,
lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.
End-state: a Haskell-on-SX layer with inverted index, query AST, boolean +
phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation
extension that merges per-peer indices.
## Status (rolling)
`bash lib/search/conformance.sh`**0/0** (not yet started)
## Ground rules
- **Scope:** only touch `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit
`spec/`, `hosts/`, `shared/`, `lib/haskell/**`, or other `lib/<lang>/`. You may
**import** from `lib/haskell/` (public API in `lib/haskell/haskell.sx`); do **not**
modify Haskell.
- **Shared-file issues** go under "Blockers" with a minimal repro; do not fix here.
- **SX files:** use `sx-tree` MCP tools only.
- **Architecture:** index = `Map Term [(DocId, [Pos])]`. Query AST = ADT. Eval =
fold of posting lists with set ops + ranking math. Ranking is pure (no IO until
result emission).
- **Commits:** one feature per commit. Keep Progress log updated and tick boxes.
## Architecture sketch
```
Document Query
{:id :text :tags} "alice AND bob OR phrase \"x y\""
│ │
▼ ▼
lib/search/tokenize.sx lib/search/parse.sx
— tokenize :: Text → [Term] — parse :: Text → Query
— normalize (lowercase, strip) — Query = Term | And | Or
— (optionally) stem | Not | Phrase
│ │
▼ ▼
lib/search/index.sx lib/search/eval.sx
— Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId]
— insert / delete / lookup — boolean + phrase positions
— persistence (optional later) │
│ ▼
└────────────────► lib/search/rank.sx
— TF-IDF / BM25 scoring
— top-N
lib/search/api.sx
— (search/index doc)
— (search/query q)
— (search/top n q)
lib/search/fed.sx
— federated query (merge peer results)
— ACL filter post-merge
```
## Phase 1 — Tokenize + index
- [ ] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
whitespace, return positions
- [ ] `lib/search/index.sx` — inverted index data structure (typed `Map` from
haskell lib); `insert`, `delete`, `lookup`
- [ ] `lib/search/api.sx``(search/index doc)`, `(search/lookup term)`
- [ ] `lib/search/tests/index.sx` — 15+ cases: tokenize, insert + lookup, update,
delete, multi-doc
- [ ] `lib/search/scoreboard.{json,md}`
- [ ] `lib/search/conformance.sh`
## Phase 2 — Query AST + boolean evaluation
- [ ] Query ADT: `Term Text | And Query Query | Or Query Query | Not Query |
Phrase [Text]`
- [ ] `lib/search/parse.sx` — query syntax parser (boolean operators, quoted phrases)
- [ ] `lib/search/eval.sx` — boolean eval via set ops on posting lists
- [ ] phrase eval — adjacency check using positions
- [ ] `lib/search/tests/boolean.sx` — 25+ cases: term, and, or, not, phrase,
composition, parser edge cases
## Phase 3 — Ranking
- [ ] document frequency tracking — extend index with `df` per term
- [ ] TF-IDF scoring
- [ ] BM25 scoring (configurable k1, b)
- [ ] top-N retrieval (heap-based)
- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF,
ranking stability, top-N correctness
## Phase 4 — ACL filter + federation
- [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)`
- [ ] federated query — fan out to peer instances via fed-sx, merge results
- [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)`
- [ ] `lib/search/tests/integration.sx` — federated search with ACL filter
## Progress log
(loop fills this in)
## Blockers
(loop fills this in)