rose-ash/plans/search-on-sx.md

# search-on-sx: Full-text + structured search on Haskell

rose-ash needs search across pages, posts, threads, federated content. Tokenize,
index, query, rank, filter by visibility. Typed ADTs make query parsing clean,
lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.

End-state: a Haskell-on-SX layer with inverted index, query AST, boolean +
phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation
extension that merges per-peer indices.

## Status (rolling)

`bash lib/search/conformance.sh` → **78/78** (Phases 1–2 complete)

## Ground rules

- **Scope:** only touch `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit
  `spec/`, `hosts/`, `shared/`, `lib/haskell/**`, or other `lib/<lang>/`. You may
  **import** from `lib/haskell/` (public API in `lib/haskell/haskell.sx`); do **not**
  modify Haskell.
- **Shared-file issues** go under "Blockers" with a minimal repro; do not fix here.
- **SX files:** use `sx-tree` MCP tools only.
- **Architecture:** index = `Map Term [(DocId, [Pos])]`. Query AST = ADT. Eval =
  fold of posting lists with set ops + ranking math. Ranking is pure (no IO until
  result emission).
- **Commits:** one feature per commit. Keep Progress log updated and tick boxes.

## Architecture sketch

```
Document                               Query
  {:id :text :tags}                       "alice AND bob OR phrase \"x y\""
        │                                       │
        ▼                                       ▼
lib/search/tokenize.sx                  lib/search/parse.sx
  — tokenize :: Text → [Term]             — parse :: Text → Query
  — normalize (lowercase, strip)          — Query = Term | And | Or
  — (optionally) stem                              | Not | Phrase
        │                                       │
        ▼                                       ▼
lib/search/index.sx                     lib/search/eval.sx
  — Map Term [(DocId, [Pos])]             — eval :: Index → Query → [DocId]
  — insert / delete / lookup              — boolean + phrase positions
  — persistence (optional later)                 │
        │                                       ▼
        └────────────────► lib/search/rank.sx
                            — TF-IDF / BM25 scoring
                            — top-N
                                  │
                                  ▼
                          lib/search/api.sx
                            — (search/index doc)
                            — (search/query q)
                            — (search/top n q)
                                  │
                                  ▼
                          lib/search/fed.sx
                            — federated query (merge peer results)
                            — ACL filter post-merge
```

## Phase 1 — Tokenize + index

- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
  whitespace, return positions
- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`,
  `lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks
  toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used —
  the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.)
- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry
  points `indexDoc` / `lookupTerm`
- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update,
  delete, multi-doc, positions, docFreq, allTerms
- [x] `lib/search/scoreboard.{json,md}`
- [x] `lib/search/conformance.sh`

## Phase 2 — Query AST + boolean evaluation

- [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query |
  Phrase [String]` (in `lib/search/query.sx`)
- [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent
  (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens,
  case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ`
- [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists
  (sortedUnion/Inter/Diff, Not over allDocs universe)
- [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
- [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase,
  composition (parser edge cases move to the parse.sx suite)

## Phase 3 — Ranking

- [ ] document frequency tracking — extend index with `df` per term
- [ ] TF-IDF scoring
- [ ] BM25 scoring (configurable k1, b)
- [ ] top-N retrieval (heap-based)
- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF,
  ranking stability, top-N correctness

## Phase 4 — ACL filter + federation

- [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)`
- [ ] federated query — fan out to peer instances via fed-sx, merge results
- [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)`
- [ ] `lib/search/tests/integration.sx` — federated search with ACL filter

## Progress log

- **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based
  delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
  implicit AND on adjacency, parens, case-insensitive keywords. `parseQuery`,
  `searchQuery`, `showQ` (canonical render for AST tests). 32 tests in parse.sx.
  **haskell-on-sx parser gotchas hit while writing this (see parse.sx header):**
  (1) escaped char literals like `'\"'` break the tokenizer — match delimiters by
  `ord c == 34`; (2) an `[]` *pattern* inside a `case` alt breaks the parser — use
  multi-clause functions instead; (3) `case`/constructor patterns and `let (a,b)=..`
  are fine. Embedded Haskell string literals in a `.sx` source string need single
  `\"`, not `\\\"`.
- **Phase 2 boolean/phrase eval (46/46 total).** Query ADT
  `Term|And|Or|Not|Phrase` + `evalQuery :: Index -> Query -> [DocId]` in query.sx.
  Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from
  the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx.
  Refactored both suites to **batch all cases into one program eval** (search-batch
  in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate
  hk-eval-program calls timed out; one combined eval per suite is ~20× faster.
  Parser (parse.sx) is the remaining Phase 2 box.
- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions),
  inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/
  lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src`
  and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a
  `search-eval` helper that forces HK values to plain SX. conformance.sh models
  lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).

## Blockers

- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents
  (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite
  timeout is set to 600s. Runs are correct, just slow.
- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx
  `import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/
  insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index
  uses a pure assoc-list instead so term iteration and federation merge stay simple.