# search-on-sx: Full-text + structured search on Haskell rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514. End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices. ## Status (rolling) `bash lib/search/conformance.sh` → **78/78** (Phases 1–2 complete) ## Ground rules - **Scope:** only touch `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit `spec/`, `hosts/`, `shared/`, `lib/haskell/**`, or other `lib//`. You may **import** from `lib/haskell/` (public API in `lib/haskell/haskell.sx`); do **not** modify Haskell. - **Shared-file issues** go under "Blockers" with a minimal repro; do not fix here. - **SX files:** use `sx-tree` MCP tools only. - **Architecture:** index = `Map Term [(DocId, [Pos])]`. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission). - **Commits:** one feature per commit. Keep Progress log updated and tick boxes. ## Architecture sketch ``` Document Query {:id :text :tags} "alice AND bob OR phrase \"x y\"" │ │ ▼ ▼ lib/search/tokenize.sx lib/search/parse.sx — tokenize :: Text → [Term] — parse :: Text → Query — normalize (lowercase, strip) — Query = Term | And | Or — (optionally) stem | Not | Phrase │ │ ▼ ▼ lib/search/index.sx lib/search/eval.sx — Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId] — insert / delete / lookup — boolean + phrase positions — persistence (optional later) │ │ ▼ └────────────────► lib/search/rank.sx — TF-IDF / BM25 scoring — top-N │ ▼ lib/search/api.sx — (search/index doc) — (search/query q) — (search/top n q) │ ▼ lib/search/fed.sx — federated query (merge peer results) — ACL filter post-merge ``` ## Phase 1 — Tokenize + index - [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on whitespace, return positions - [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`, `lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used — the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.) - [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry points `indexDoc` / `lookupTerm` - [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTerms - [x] `lib/search/scoreboard.{json,md}` - [x] `lib/search/conformance.sh` ## Phase 2 — Query AST + boolean evaluation - [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query | Phrase [String]` (in `lib/search/query.sx`) - [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ` - [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe) - [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt) - [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite) ## Phase 3 — Ranking - [ ] document frequency tracking — extend index with `df` per term - [ ] TF-IDF scoring - [ ] BM25 scoring (configurable k1, b) - [ ] top-N retrieval (heap-based) - [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF, ranking stability, top-N correctness ## Phase 4 — ACL filter + federation - [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)` - [ ] federated query — fan out to peer instances via fed-sx, merge results - [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)` - [ ] `lib/search/tests/integration.sx` — federated search with ACL filter ## Progress log - **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR Query -> [DocId]` in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to **batch all cases into one program eval** (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box. - **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src` and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a `search-eval` helper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail). ## Blockers - **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow. - **Data.Map public API gap (informational, not fixing):** the haskell-on-sx `import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.