Files
rose-ash/plans/search-on-sx.md
giles db2a5dc6ab
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 40s
search: boolean-filtered ranked search + 11 tests
searchRankTfIdf/searchRankBm25 parse a boolean query, filter docs via evalQuery,
then rank survivors by relevance over the query's leaf terms (queryTerms) — the
filter-then-rank pattern. 225/225.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 23:58:37 +00:00

12 KiB
Raw Blame History

search-on-sx: Full-text + structured search on Haskell

rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.

End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices.

Status (rolling)

bash lib/search/conformance.sh122/122 (Phases 14 complete)

Ground rules

  • Scope: only touch lib/search/** and plans/search-on-sx.md. Do not edit spec/, hosts/, shared/, lib/haskell/**, or other lib/<lang>/. You may import from lib/haskell/ (public API in lib/haskell/haskell.sx); do not modify Haskell.
  • Shared-file issues go under "Blockers" with a minimal repro; do not fix here.
  • SX files: use sx-tree MCP tools only.
  • Architecture: index = Map Term [(DocId, [Pos])]. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission).
  • Commits: one feature per commit. Keep Progress log updated and tick boxes.

Architecture sketch

Document                               Query
  {:id :text :tags}                       "alice AND bob OR phrase \"x y\""
        │                                       │
        ▼                                       ▼
lib/search/tokenize.sx                  lib/search/parse.sx
  — tokenize :: Text → [Term]             — parse :: Text → Query
  — normalize (lowercase, strip)          — Query = Term | And | Or
  — (optionally) stem                              | Not | Phrase
        │                                       │
        ▼                                       ▼
lib/search/index.sx                     lib/search/eval.sx
  — Map Term [(DocId, [Pos])]             — eval :: Index → Query → [DocId]
  — insert / delete / lookup              — boolean + phrase positions
  — persistence (optional later)                 │
        │                                       ▼
        └────────────────► lib/search/rank.sx
                            — TF-IDF / BM25 scoring
                            — top-N
                                  │
                                  ▼
                          lib/search/api.sx
                            — (search/index doc)
                            — (search/query q)
                            — (search/top n q)
                                  │
                                  ▼
                          lib/search/fed.sx
                            — federated query (merge peer results)
                            — ACL filter post-merge

Phase 1 — Tokenize + index

  • lib/search/tokenize.sx — normalize (lowercase, strip punctuation), split on whitespace, return positions
  • lib/search/index.sx — inverted index data structure; indexDoc, deleteDoc, lookupTerm, docFreq, allTerms. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list [(Term,[(DocId,[Pos])])] is used — the conceptual Map Term [(DocId,[Pos])] with free term iteration.)
  • lib/search/api.sx — assembles search/src (tokenize + index); Haskell entry points indexDoc / lookupTerm
  • lib/search/tests/index.sx — 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTerms
  • lib/search/scoreboard.{json,md}
  • lib/search/conformance.sh

Phase 2 — Query AST + boolean evaluation

  • Query ADT: Term String | And Query Query | Or Query Query | Not Query | Phrase [String] (in lib/search/query.sx)
  • lib/search/parse.sx — query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords); parseQuery, searchQuery, showQ
  • lib/search/query.sx — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe)
  • phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
  • lib/search/tests/boolean.sx — 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite)

Phase 3 — Ranking

  • document frequency — docFreq/idf/bm25idf derived from the index (posting-list length); no separate df store needed
  • TF-IDF scoring (rankTfIdf)
  • BM25 scoring, configurable k1/b (rankBm25 k1 b)
  • top-N retrieval (topNTfIdf/topNBm25 — sortBy + take; stable DocId tiebreak)
  • lib/search/tests/rank.sx — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
    • tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N

Phase 4 — ACL filter + federation

  • post-filter — aclFilter/searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl take an injected permit :: DocId -> Bool predicate, applied post-rank (never in the index)
  • federated query — fedIndex :: [(PeerId, Index)] -> Index merges per-peer inverted indices (union posting lists per term); rank/search run once over the merge
  • merge policy — relabel local DocIds to global gid = peer*1000 + local (bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score
  • lib/search/tests/integration.sx — 21 cases: index merge, cross-peer df/lookup, position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25

Extensions (post-roadmap, search-shaped vocabulary)

  • prefix / wildcard queries (prefixTerms, prefixDocs, prefixRankTfIdf) — 14 tests
  • fuzzy matching — edit distance term expansion (editDist, fuzzyTerms, fuzzyDocs, fuzzyRankTfIdf) — 18 tests
  • result pagination (offset / limit) — paginate, pageTfIdf, pageBm25, resultCount — 12 tests
  • snippet / highlight generation (highlight, snippet) — 12 tests
  • stemming (suffix stripping) — stem, stemText, stemTokens, indexStemmed — 18 tests
  • proximity / NEAR — nearDocs k t1 t2 (unordered, within k positions) — 9 tests
  • synonym / query expansion — expandTerm, synDocs, synRankTfIdf — 9 tests
  • boolean-filtered ranked search — queryTerms, searchRankTfIdf, searchRankBm25 (filter by boolean query, rank survivors by relevance) — 11 tests

Progress log

  • Extension: boolean-filtered ranked search (225/225 total). searchRankTfIdf/ searchRankBm25 parse a boolean query, filter docs via evalQuery, then rank the survivors by relevance over the query's leaf terms (queryTerms) — the real-world filter-then-rank pattern. 11 tests.
  • Extension: synonyms/query expansion (214/214 total). A synonym map [(Term,[Term])] expands a query term to itself + synonyms (expandTerm); synDocs unions, synRankTfIdf ranks the expanded set. 9 tests.
  • Extension: proximity/NEAR (205/205 total). nearDocs k t1 t2 idx returns docs where both terms occur within k positions (unordered), candidates = posting intersection, filtered on the positional postings. 9 tests.
  • Extension: stemming (196/196 total). Deterministic English suffix stripping (stem), stemText/stemTokens, indexStemmed. Two haskell-on-sx gotchas: take/drop over a String yield char CODES not char strings (rebuild via joinChars . map chr), and isSuffixOf's reverse trips ++ on the String repr (manual suffix compare). All five planned extensions now done; the loop can keep adding search vocabulary. 18 tests.
  • Extension: highlight/snippet (178/178 total). highlight terms text marks query-matching (normalized) tokens with [..]; snippet ctx terms text extracts a context window around the first match. 12 tests.
  • Extension: fuzzy matching (166/166 total). Levenshtein editDist as an O(m*n) row-based DP (the naive recursive version is exponential and times out under load), fuzzyTerms/fuzzyDocs/fuzzyRankTfIdf expand a term to indexed terms within a max edit distance. 18 tests.
  • Extension: pagination (148/148 total). paginate off lim windows a ranked list (take lim . drop off); pageTfIdf/pageBm25 + resultCount. 12 tests. Note the full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s under the heavy box load.
  • Extension: prefix/wildcard queries (136/136 total). prefixTerms matches every indexed term starting with a prefix (via allTerms + isPrefixOf); prefixDocs unions their docs; prefixRankTfIdf ranks treating the matched terms as the query. 14 tests.
  • Phase 4 complete — federation + ACL (122/122 total). Roadmap done. fedIndex merges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to global gid = peer*1000 + local — the bijection makes (peer,doc-id) dedupe automatic and keeps positions, so ranking runs once over the merge and interleaves peers by score (rank-correct). ACL is a post-rank filter over an injected permit :: DocId -> Bool (viewer baked in by the caller) — never in the index; searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl. 21 integration tests.
  • Phase 3 complete — ranking (101/101 total). TF-IDF (rankTfIdf) and BM25 (rankBm25 k1 b) over the candidate set (docs containing any query term), scores as floats with deterministic DocId-ascending tiebreak; topNTfIdf/topNBm25 via sortBy+take. df/idf derived from posting-list length (no separate df store). 23 tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter effect. Float division/log/float literals all work in haskell-on-sx.
  • Phase 2 complete — parser (78/78 total). Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence, implicit AND on adjacency, parens, case-insensitive keywords. parseQuery, searchQuery, showQ (canonical render for AST tests). 32 tests in parse.sx. haskell-on-sx parser gotchas hit while writing this (see parse.sx header): (1) escaped char literals like '\"' break the tokenizer — match delimiters by ord c == 34; (2) an [] pattern inside a case alt breaks the parser — use multi-clause functions instead; (3) case/constructor patterns and let (a,b)=.. are fine. Embedded Haskell string literals in a .sx source string need single \", not \\\".
  • Phase 2 boolean/phrase eval (46/46 total). Query ADT Term|And|Or|Not|Phrase + evalQuery :: Index -> Query -> [DocId] in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to batch all cases into one program eval (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 1828 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box.
  • Phase 1 complete (18/18). Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list [(Term,[(DocId,[Pos])])], indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into search/src and evaluated via the haskell-on-sx interpreter; tests reuse hk-test counters and a search-eval helper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).

Blockers

  • None. Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow.
  • Data.Map public API gap (informational, not fixing): the haskell-on-sx import Data.Map binds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.