Files
rose-ash/plans/search-on-sx.md
giles 4c84decc01
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 46s
search: Phase 2 query parser + 32 tests
Query tokenizer + recursive-descent parser: OR<AND<NOT precedence, implicit AND
on adjacency, quoted phrases, parens, case-insensitive keywords. parseQuery,
searchQuery, showQ. Worked around haskell-on-sx parser limits (ord-based
delimiters; multi-clause fns instead of []-pattern case alts). 78/78.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:43:10 +00:00

7.4 KiB
Raw Blame History

search-on-sx: Full-text + structured search on Haskell

rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.

End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices.

Status (rolling)

bash lib/search/conformance.sh78/78 (Phases 12 complete)

Ground rules

  • Scope: only touch lib/search/** and plans/search-on-sx.md. Do not edit spec/, hosts/, shared/, lib/haskell/**, or other lib/<lang>/. You may import from lib/haskell/ (public API in lib/haskell/haskell.sx); do not modify Haskell.
  • Shared-file issues go under "Blockers" with a minimal repro; do not fix here.
  • SX files: use sx-tree MCP tools only.
  • Architecture: index = Map Term [(DocId, [Pos])]. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission).
  • Commits: one feature per commit. Keep Progress log updated and tick boxes.

Architecture sketch

Document                               Query
  {:id :text :tags}                       "alice AND bob OR phrase \"x y\""
        │                                       │
        ▼                                       ▼
lib/search/tokenize.sx                  lib/search/parse.sx
  — tokenize :: Text → [Term]             — parse :: Text → Query
  — normalize (lowercase, strip)          — Query = Term | And | Or
  — (optionally) stem                              | Not | Phrase
        │                                       │
        ▼                                       ▼
lib/search/index.sx                     lib/search/eval.sx
  — Map Term [(DocId, [Pos])]             — eval :: Index → Query → [DocId]
  — insert / delete / lookup              — boolean + phrase positions
  — persistence (optional later)                 │
        │                                       ▼
        └────────────────► lib/search/rank.sx
                            — TF-IDF / BM25 scoring
                            — top-N
                                  │
                                  ▼
                          lib/search/api.sx
                            — (search/index doc)
                            — (search/query q)
                            — (search/top n q)
                                  │
                                  ▼
                          lib/search/fed.sx
                            — federated query (merge peer results)
                            — ACL filter post-merge

Phase 1 — Tokenize + index

  • lib/search/tokenize.sx — normalize (lowercase, strip punctuation), split on whitespace, return positions
  • lib/search/index.sx — inverted index data structure; indexDoc, deleteDoc, lookupTerm, docFreq, allTerms. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list [(Term,[(DocId,[Pos])])] is used — the conceptual Map Term [(DocId,[Pos])] with free term iteration.)
  • lib/search/api.sx — assembles search/src (tokenize + index); Haskell entry points indexDoc / lookupTerm
  • lib/search/tests/index.sx — 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTerms
  • lib/search/scoreboard.{json,md}
  • lib/search/conformance.sh

Phase 2 — Query AST + boolean evaluation

  • Query ADT: Term String | And Query Query | Or Query Query | Not Query | Phrase [String] (in lib/search/query.sx)
  • lib/search/parse.sx — query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords); parseQuery, searchQuery, showQ
  • lib/search/query.sx — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe)
  • phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
  • lib/search/tests/boolean.sx — 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite)

Phase 3 — Ranking

  • document frequency tracking — extend index with df per term
  • TF-IDF scoring
  • BM25 scoring (configurable k1, b)
  • top-N retrieval (heap-based)
  • lib/search/tests/rank.sx — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF, ranking stability, top-N correctness

Phase 4 — ACL filter + federation

  • post-filter — each candidate result tested via (acl/permit? viewer :read doc)
  • federated query — fan out to peer instances via fed-sx, merge results
  • merge policy — interleave by rank, dedupe by (peer, doc-id)
  • lib/search/tests/integration.sx — federated search with ACL filter

Progress log

  • Phase 2 complete — parser (78/78 total). Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence, implicit AND on adjacency, parens, case-insensitive keywords. parseQuery, searchQuery, showQ (canonical render for AST tests). 32 tests in parse.sx. haskell-on-sx parser gotchas hit while writing this (see parse.sx header): (1) escaped char literals like '\"' break the tokenizer — match delimiters by ord c == 34; (2) an [] pattern inside a case alt breaks the parser — use multi-clause functions instead; (3) case/constructor patterns and let (a,b)=.. are fine. Embedded Haskell string literals in a .sx source string need single \", not \\\".
  • Phase 2 boolean/phrase eval (46/46 total). Query ADT Term|And|Or|Not|Phrase + evalQuery :: Index -> Query -> [DocId] in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to batch all cases into one program eval (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 1828 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box.
  • Phase 1 complete (18/18). Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list [(Term,[(DocId,[Pos])])], indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into search/src and evaluated via the haskell-on-sx interpreter; tests reuse hk-test counters and a search-eval helper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).

Blockers

  • None. Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow.
  • Data.Map public API gap (informational, not fixing): the haskell-on-sx import Data.Map binds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.