Merge loops/search into architecture: search-on-sx full-text search on Haskell

Tokenizer + inverted index, query AST (boolean/phrase) + parser, TF-IDF/BM25 ranking + top-N, federation merge + ACL post-filter, and 9 extensions (prefix, pagination, fuzzy, highlight, stem, NEAR, synonyms, boolean-ranked search, did-you-mean). lib/search/conformance.sh => 234/234 across 14 suites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 09:16:57 +00:00
parent c5faf93813 5d62d08e1c
commit 644ea178c2
37 changed files with 1669 additions and 28 deletions
--- a/plans/agent-briefings/search-loop.md
+++ b/plans/agent-briefings/search-loop.md
@@ -0,0 +1,110 @@
+# search-on-sx loop agent (single agent, queue-driven)
+
+Role: iterates `plans/search-on-sx.md` forever. **Full-text + structured search on
+Haskell** — tokenize, inverted index, query AST, boolean + phrase + ranked
+queries (TF-IDF / BM25), ACL-aware post-filter, federated index merge. Typed ADTs
+make query parsing clean; lazy lists make posting-list iteration efficient. Sits on
+`lib/haskell/` (1514/1514 already green); adds a search-shaped vocabulary on top.
+
+```
+description: search-on-sx queue loop
+subagent_type: general-purpose
+run_in_background: true
+isolation: worktree
+```
+
+## Prompt
+
+You are the sole background agent working `plans/search-on-sx.md`. Isolated
+worktree `/root/rose-ash-loops/search` on branch `loops/search`, forever, one
+commit per feature. Push to `origin/loops/search` after every commit. Never touch
+`main` or `architecture`.
+
+## Restart baseline — check before iterating
+
+1. Read `plans/search-on-sx.md` — roadmap + Progress log.
+2. `ls lib/search/` — pick up from the most advanced file.
+3. If `lib/search/tests/*.sx` exist, run them via `bash lib/search/conformance.sh`.
+   Green before new work.
+4. If `lib/search/scoreboard.md` exists, that's your baseline.
+5. Read the `lib/haskell/` public API once — that's your substrate. `lib/haskell/
+   haskell.sx` exists; also study `runtime.sx`, `eval.sx`, `parser.sx`, `infer.sx`,
+   `match.sx`, `map.sx`, `set.sx`, `testlib.sx`. Learn how to declare ADTs, pattern
+   match, and use the `Map`/`Set` helpers before writing index code. Verify the real
+   exported names with sx_find_all / grep — don't assume from the plan's sketch.
+
+## The queue
+
+Phase order per `plans/search-on-sx.md`:
+
+- **Phase 1** — tokenize + inverted index + simple term lookup
+  (`Map Term [(DocId,[Pos])]`, insert/lookup, `(search/index doc)`,
+  `(search/query term)`).
+- **Phase 2** — query AST + boolean/phrase eval (Term | And | Or | Not | Phrase;
+  posting-list set ops; positional phrase match).
+- **Phase 3** — ranking (TF-IDF, BM25), top-N.
+- **Phase 4** — ACL-aware post-filter + federation (merge per-peer indices).
+
+Within a phase, pick the checkbox that unlocks the most tests per effort.
+
+Every iteration: implement → test → commit → tick `[ ]` → Progress log → next.
+
+## Ground rules (hard)
+
+- **Scope:** only `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit
+  `spec/`, `hosts/`, `shared/`, other `lib/<lang>/` dirs, `lib/stdlib.sx`, or
+  `lib/` root. May **import** from `lib/haskell/` only (its public API). Do **not**
+  modify Haskell.
+- **NEVER call `sx_build`.** 600s watchdog. If the sx_server binary is broken →
+  Blockers entry, stop. Run tests by invoking the sx_server binary directly from a
+  conformance.sh (model it on `lib/haskell/conformance.sh`), pointing `SX_SERVER`
+  at `/root/rose-ash/hosts/ocaml/_build/default/bin/sx_server.exe` — fresh
+  worktrees have no `_build/`, so the relative path won't resolve.
+- **Shared-file issues** → plan's Blockers with minimal repro; don't fix here.
+- **SX files:** `sx-tree` MCP tools ONLY. **They take `file:` not `path:`** — a
+  wrong key yields `Yojson Type_error("Expected string, got null")`, which looks
+  like a broken binary but is just a param mismatch. `sx_validate` after edits.
+  Path-based edits (`sx_replace_node`) count comment headers in their indices and
+  can clobber the wrong node — re-read after, or prefer `sx_write_file` for small
+  files.
+- **Unicode in `.sx`:** raw UTF-8 only, never `\uXXXX` escapes.
+- **Commit granularity:** one feature per commit. Short factual messages
+  (`search: phrase query positional match + 7 tests`). Push to `origin/loops/search`.
+- **Plan file:** update Progress log (newest first) + tick boxes every commit.
+
+## search-specific gotchas
+
+- **Posting lists are the hot path.** Keep them sorted by DocId so boolean AND/OR
+  are linear merges, not nested scans. Phrase match needs positions, so store
+  `(DocId, [Pos])` — don't drop positions early to save space; you can't recover them.
+- **Tokenization decides recall.** Normalize consistently (lowercase, strip
+  punctuation) on BOTH index and query side, or queries silently miss. Test the
+  index/query symmetry explicitly.
+- **Ranking must be deterministic on ties.** TF-IDF/BM25 scores collide; always
+  add a stable tiebreak (DocId ascending) or tests flake.
+- **ACL filter is per-viewer and post-ranking.** Filter the result list against the
+  viewer, after scoring — never bake visibility into the index (the same index
+  serves all viewers). Inject the permit predicate; don't hardwire an ACL module
+  that doesn't exist yet.
+- **Federation merges indices, not results.** Merging per-peer inverted indices
+  (union posting lists per term) is cleaner and rank-correct vs merging ranked
+  result lists. Mock peer indices in tests.
+
+## General gotchas (all loops)
+
+- SX `do` = R7RS iteration. Use `begin` for multi-expr sequences.
+- `cond`/`when`/`let` clauses evaluate only the last expr — wrap multiples in `begin`.
+- `let` is parallel, not sequential — nest `let`s when a binding references an earlier one.
+- `env-bind!` creates a binding; `env-set!` mutates an existing one (walks scope chain).
+- `sx_validate` after every structural edit.
+- Namespace-prefix all guest helpers (`search/...`) — short/host-colliding names
+  get silently shadowed or hang the runtime.
+
+## Style
+
+- No comments in `.sx` unless non-obvious.
+- No new planning docs — update `plans/search-on-sx.md` inline.
+- Short, factual commit messages.
+- One feature per iteration. Commit. Log. Push. Next.
+
+Go. Start by reading the plan; find the first unchecked `[ ]`; implement it.
--- a/plans/search-on-sx.md
+++ b/plans/search-on-sx.md
@@ -10,7 +10,7 @@ extension that merges per-peer indices.

 ## Status (rolling)

-`bash lib/search/conformance.sh` → **0/0** (not yet started)
+`bash lib/search/conformance.sh` → **122/122** (Phases 1–4 complete)

 ## Ground rules

@@ -61,46 +61,148 @@ lib/search/index.sx                     lib/search/eval.sx

 ## Phase 1 — Tokenize + index

- [ ] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
+- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
  whitespace, return positions
- [ ] `lib/search/index.sx` — inverted index data structure (typed `Map` from
-  haskell lib); `insert`, `delete`, `lookup`
- [ ] `lib/search/api.sx` — `(search/index doc)`, `(search/lookup term)`
- [ ] `lib/search/tests/index.sx` — 15+ cases: tokenize, insert + lookup, update,
-  delete, multi-doc
- [ ] `lib/search/scoreboard.{json,md}`
- [ ] `lib/search/conformance.sh`
+- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`,
+  `lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks
+  toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used —
+  the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.)
+- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry
+  points `indexDoc` / `lookupTerm`
+- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update,
+  delete, multi-doc, positions, docFreq, allTerms
+- [x] `lib/search/scoreboard.{json,md}`
+- [x] `lib/search/conformance.sh`

 ## Phase 2 — Query AST + boolean evaluation

- [ ] Query ADT: `Term Text | And Query Query | Or Query Query | Not Query |
-  Phrase [Text]`
- [ ] `lib/search/parse.sx` — query syntax parser (boolean operators, quoted phrases)
- [ ] `lib/search/eval.sx` — boolean eval via set ops on posting lists
- [ ] phrase eval — adjacency check using positions
- [ ] `lib/search/tests/boolean.sx` — 25+ cases: term, and, or, not, phrase,
-  composition, parser edge cases
+- [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query |
+  Phrase [String]` (in `lib/search/query.sx`)
+- [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent
+  (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens,
+  case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ`
+- [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists
+  (sortedUnion/Inter/Diff, Not over allDocs universe)
+- [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
+- [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase,
+  composition (parser edge cases move to the parse.sx suite)

 ## Phase 3 — Ranking

- [ ] document frequency tracking — extend index with `df` per term
- [ ] TF-IDF scoring
- [ ] BM25 scoring (configurable k1, b)
- [ ] top-N retrieval (heap-based)
- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF,
-  ranking stability, top-N correctness
+- [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index
+  (posting-list length); no separate df store needed
+- [x] TF-IDF scoring (`rankTfIdf`)
+- [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`)
+- [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak)
+- [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
+  + tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N

 ## Phase 4 — ACL filter + federation

- [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)`
- [ ] federated query — fan out to peer instances via fed-sx, merge results
- [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)`
- [ ] `lib/search/tests/integration.sx` — federated search with ACL filter
+- [x] post-filter — `aclFilter`/`searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl` take an
+  injected `permit :: DocId -> Bool` predicate, applied post-rank (never in the index)
+- [x] federated query — `fedIndex :: [(PeerId, Index)] -> Index` merges per-peer
+  inverted indices (union posting lists per term); rank/search run once over the merge
+- [x] merge policy — relabel local DocIds to global `gid = peer*1000 + local`
+  (bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score
+- [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup,
+  position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25
+
+## Extensions (post-roadmap, search-shaped vocabulary)
+
+- [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests
+- [x] fuzzy matching — edit distance term expansion (`editDist`, `fuzzyTerms`,
+  `fuzzyDocs`, `fuzzyRankTfIdf`) — 18 tests
+- [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`,
+  `resultCount` — 12 tests
+- [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests
+- [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed`
+  — 18 tests
+- [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests
+- [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests
+- [x] boolean-filtered ranked search — `queryTerms`, `searchRankTfIdf`,
+  `searchRankBm25` (filter by boolean query, rank survivors by relevance) — 11 tests
+- [x] did-you-mean / spelling suggestion — `suggest`, `suggestN` (closest indexed
+  terms by edit distance, alphabetical tiebreak) — 9 tests

 ## Progress log

-(loop fills this in)
+- **Extension: did-you-mean / spelling suggestion (234/234 total).** `suggest`/`suggestN`
+  rank indexed terms by edit distance to a (misspelled) query term, alphabetical
+  tiebreak. 9 tests.
+- **Extension: boolean-filtered ranked search (225/225 total).** `searchRankTfIdf`/
+  `searchRankBm25` parse a boolean query, filter docs via evalQuery, then rank the
+  survivors by relevance over the query's leaf terms (`queryTerms`) — the real-world
+  filter-then-rank pattern. 11 tests.
+- **Extension: synonyms/query expansion (214/214 total).** A synonym map
+  `[(Term,[Term])]` expands a query term to itself + synonyms (`expandTerm`); `synDocs`
+  unions, `synRankTfIdf` ranks the expanded set. 9 tests.
+- **Extension: proximity/NEAR (205/205 total).** `nearDocs k t1 t2 idx` returns docs
+  where both terms occur within k positions (unordered), candidates = posting
+  intersection, filtered on the positional postings. 9 tests.
+- **Extension: stemming (196/196 total).** Deterministic English suffix stripping
+  (`stem`), `stemText`/`stemTokens`, `indexStemmed`. Two haskell-on-sx gotchas: take/drop
+  over a String yield char CODES not char strings (rebuild via `joinChars . map chr`),
+  and isSuffixOf's `reverse` trips `++` on the String repr (manual suffix compare). All
+  five planned extensions now done; the loop can keep adding search vocabulary. 18 tests.
+- **Extension: highlight/snippet (178/178 total).** `highlight terms text` marks
+  query-matching (normalized) tokens with [..]; `snippet ctx terms text` extracts a
+  context window around the first match. 12 tests.
+- **Extension: fuzzy matching (166/166 total).** Levenshtein `editDist` as an O(m*n)
+  row-based DP (the naive recursive version is exponential and times out under load),
+  `fuzzyTerms`/`fuzzyDocs`/`fuzzyRankTfIdf` expand a term to indexed terms within a max
+  edit distance. 18 tests.
+- **Extension: pagination (148/148 total).** `paginate off lim` windows a ranked list
+  (take lim . drop off); `pageTfIdf`/`pageBm25` + `resultCount`. 12 tests. Note the
+  full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s
+  under the heavy box load.
+- **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every
+  indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions
+  their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests.
+- **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex`
+  merges per-peer inverted indices (union posting lists per term) after relabelling
+  local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id)
+  dedupe automatic and keeps positions, so ranking runs once over the merge and
+  interleaves peers by score (rank-correct). ACL is a post-rank `filter` over an
+  injected `permit :: DocId -> Bool` (viewer baked in by the caller) — never in the
+  index; `searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl`. 21 integration tests.
+- **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25
+  (`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores
+  as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via
+  sortBy+take. df/idf derived from posting-list length (no separate df store). 23
+  tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter
+  effect. Float division/`log`/float literals all work in haskell-on-sx.
+- **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based
+  delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
+  implicit AND on adjacency, parens, case-insensitive keywords. `parseQuery`,
+  `searchQuery`, `showQ` (canonical render for AST tests). 32 tests in parse.sx.
+  **haskell-on-sx parser gotchas hit while writing this (see parse.sx header):**
+  (1) escaped char literals like `'\"'` break the tokenizer — match delimiters by
+  `ord c == 34`; (2) an `[]` *pattern* inside a `case` alt breaks the parser — use
+  multi-clause functions instead; (3) `case`/constructor patterns and `let (a,b)=..`
+  are fine. Embedded Haskell string literals in a `.sx` source string need single
+  `\"`, not `\\\"`.
+- **Phase 2 boolean/phrase eval (46/46 total).** Query ADT
+  `Term|And|Or|Not|Phrase` + `evalQuery :: Index -> Query -> [DocId]` in query.sx.
+  Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from
+  the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx.
+  Refactored both suites to **batch all cases into one program eval** (search-batch
+  in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate
+  hk-eval-program calls timed out; one combined eval per suite is ~20× faster.
+  Parser (parse.sx) is the remaining Phase 2 box.
+- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions),
+  inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/
+  lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src`
+  and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a
+  `search-eval` helper that forces HK values to plain SX. conformance.sh models
+  lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).

 ## Blockers

-(loop fills this in)
+- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents
+  (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite
+  timeout is set to 600s. Runs are correct, just slow.
+- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx
+  `import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/
+  insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index
+  uses a pure assoc-list instead so term iteration and federation merge stay simple.