# search-on-sx: Full-text + structured search on Haskell rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514. End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices. ## Status (rolling) `bash lib/search/conformance.sh` → **122/122** (Phases 1–4 complete) ## Ground rules - **Scope:** only touch `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit `spec/`, `hosts/`, `shared/`, `lib/haskell/**`, or other `lib//`. You may **import** from `lib/haskell/` (public API in `lib/haskell/haskell.sx`); do **not** modify Haskell. - **Shared-file issues** go under "Blockers" with a minimal repro; do not fix here. - **SX files:** use `sx-tree` MCP tools only. - **Architecture:** index = `Map Term [(DocId, [Pos])]`. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission). - **Commits:** one feature per commit. Keep Progress log updated and tick boxes. ## Architecture sketch ``` Document Query {:id :text :tags} "alice AND bob OR phrase \"x y\"" │ │ ▼ ▼ lib/search/tokenize.sx lib/search/parse.sx — tokenize :: Text → [Term] — parse :: Text → Query — normalize (lowercase, strip) — Query = Term | And | Or — (optionally) stem | Not | Phrase │ │ ▼ ▼ lib/search/index.sx lib/search/eval.sx — Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId] — insert / delete / lookup — boolean + phrase positions — persistence (optional later) │ │ ▼ └────────────────► lib/search/rank.sx — TF-IDF / BM25 scoring — top-N │ ▼ lib/search/api.sx — (search/index doc) — (search/query q) — (search/top n q) │ ▼ lib/search/fed.sx — federated query (merge peer results) — ACL filter post-merge ``` ## Phase 1 — Tokenize + index - [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on whitespace, return positions - [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`, `lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used — the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.) - [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry points `indexDoc` / `lookupTerm` - [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTerms - [x] `lib/search/scoreboard.{json,md}` - [x] `lib/search/conformance.sh` ## Phase 2 — Query AST + boolean evaluation - [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query | Phrase [String]` (in `lib/search/query.sx`) - [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ` - [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe) - [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt) - [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite) ## Phase 3 — Ranking - [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index (posting-list length); no separate df store needed - [x] TF-IDF scoring (`rankTfIdf`) - [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`) - [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak) - [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm + tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N ## Phase 4 — ACL filter + federation - [x] post-filter — `aclFilter`/`searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl` take an injected `permit :: DocId -> Bool` predicate, applied post-rank (never in the index) - [x] federated query — `fedIndex :: [(PeerId, Index)] -> Index` merges per-peer inverted indices (union posting lists per term); rank/search run once over the merge - [x] merge policy — relabel local DocIds to global `gid = peer*1000 + local` (bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score - [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup, position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25 ## Extensions (post-roadmap, search-shaped vocabulary) - [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests - [x] fuzzy matching — edit distance term expansion (`editDist`, `fuzzyTerms`, `fuzzyDocs`, `fuzzyRankTfIdf`) — 18 tests - [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`, `resultCount` — 12 tests - [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests - [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed` — 18 tests - [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests - [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests - [x] boolean-filtered ranked search — `queryTerms`, `searchRankTfIdf`, `searchRankBm25` (filter by boolean query, rank survivors by relevance) — 11 tests ## Progress log - **Extension: boolean-filtered ranked search (225/225 total).** `searchRankTfIdf`/ `searchRankBm25` parse a boolean query, filter docs via evalQuery, then rank the survivors by relevance over the query's leaf terms (`queryTerms`) — the real-world filter-then-rank pattern. 11 tests. - **Extension: synonyms/query expansion (214/214 total).** A synonym map `[(Term,[Term])]` expands a query term to itself + synonyms (`expandTerm`); `synDocs` unions, `synRankTfIdf` ranks the expanded set. 9 tests. - **Extension: proximity/NEAR (205/205 total).** `nearDocs k t1 t2 idx` returns docs where both terms occur within k positions (unordered), candidates = posting intersection, filtered on the positional postings. 9 tests. - **Extension: stemming (196/196 total).** Deterministic English suffix stripping (`stem`), `stemText`/`stemTokens`, `indexStemmed`. Two haskell-on-sx gotchas: take/drop over a String yield char CODES not char strings (rebuild via `joinChars . map chr`), and isSuffixOf's `reverse` trips `++` on the String repr (manual suffix compare). All five planned extensions now done; the loop can keep adding search vocabulary. 18 tests. - **Extension: highlight/snippet (178/178 total).** `highlight terms text` marks query-matching (normalized) tokens with [..]; `snippet ctx terms text` extracts a context window around the first match. 12 tests. - **Extension: fuzzy matching (166/166 total).** Levenshtein `editDist` as an O(m*n) row-based DP (the naive recursive version is exponential and times out under load), `fuzzyTerms`/`fuzzyDocs`/`fuzzyRankTfIdf` expand a term to indexed terms within a max edit distance. 18 tests. - **Extension: pagination (148/148 total).** `paginate off lim` windows a ranked list (take lim . drop off); `pageTfIdf`/`pageBm25` + `resultCount`. 12 tests. Note the full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s under the heavy box load. - **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests. - **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex` merges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id) dedupe automatic and keeps positions, so ranking runs once over the merge and interleaves peers by score (rank-correct). ACL is a post-rank `filter` over an injected `permit :: DocId -> Bool` (viewer baked in by the caller) — never in the index; `searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl`. 21 integration tests. - **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25 (`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via sortBy+take. df/idf derived from posting-list length (no separate df store). 23 tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter effect. Float division/`log`/float literals all work in haskell-on-sx. - **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR Query -> [DocId]` in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to **batch all cases into one program eval** (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box. - **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src` and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a `search-eval` helper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail). ## Blockers - **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow. - **Data.Map public API gap (informational, not fixing):** the haskell-on-sx `import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.