Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 46s
Query tokenizer + recursive-descent parser: OR<AND<NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords. parseQuery, searchQuery, showQ. Worked around haskell-on-sx parser limits (ord-based delimiters; multi-clause fns instead of []-pattern case alts). 78/78. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.4 KiB
7.4 KiB
search-on-sx: Full-text + structured search on Haskell
rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.
End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices.
Status (rolling)
bash lib/search/conformance.sh → 78/78 (Phases 1–2 complete)
Ground rules
- Scope: only touch
lib/search/**andplans/search-on-sx.md. Do not editspec/,hosts/,shared/,lib/haskell/**, or otherlib/<lang>/. You may import fromlib/haskell/(public API inlib/haskell/haskell.sx); do not modify Haskell. - Shared-file issues go under "Blockers" with a minimal repro; do not fix here.
- SX files: use
sx-treeMCP tools only. - Architecture: index =
Map Term [(DocId, [Pos])]. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission). - Commits: one feature per commit. Keep Progress log updated and tick boxes.
Architecture sketch
Document Query
{:id :text :tags} "alice AND bob OR phrase \"x y\""
│ │
▼ ▼
lib/search/tokenize.sx lib/search/parse.sx
— tokenize :: Text → [Term] — parse :: Text → Query
— normalize (lowercase, strip) — Query = Term | And | Or
— (optionally) stem | Not | Phrase
│ │
▼ ▼
lib/search/index.sx lib/search/eval.sx
— Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId]
— insert / delete / lookup — boolean + phrase positions
— persistence (optional later) │
│ ▼
└────────────────► lib/search/rank.sx
— TF-IDF / BM25 scoring
— top-N
│
▼
lib/search/api.sx
— (search/index doc)
— (search/query q)
— (search/top n q)
│
▼
lib/search/fed.sx
— federated query (merge peer results)
— ACL filter post-merge
Phase 1 — Tokenize + index
lib/search/tokenize.sx— normalize (lowercase, strip punctuation), split on whitespace, return positionslib/search/index.sx— inverted index data structure;indexDoc,deleteDoc,lookupTerm,docFreq,allTerms. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list[(Term,[(DocId,[Pos])])]is used — the conceptualMap Term [(DocId,[Pos])]with free term iteration.)lib/search/api.sx— assemblessearch/src(tokenize + index); Haskell entry pointsindexDoc/lookupTermlib/search/tests/index.sx— 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTermslib/search/scoreboard.{json,md}lib/search/conformance.sh
Phase 2 — Query AST + boolean evaluation
- Query ADT:
Term String | And Query Query | Or Query Query | Not Query | Phrase [String](inlib/search/query.sx) lib/search/parse.sx— query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords);parseQuery,searchQuery,showQlib/search/query.sx— boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe)- phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
lib/search/tests/boolean.sx— 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite)
Phase 3 — Ranking
- document frequency tracking — extend index with
dfper term - TF-IDF scoring
- BM25 scoring (configurable k1, b)
- top-N retrieval (heap-based)
lib/search/tests/rank.sx— 20+ cases: TF-IDF behavior, BM25 vs TF-IDF, ranking stability, top-N correctness
Phase 4 — ACL filter + federation
- post-filter — each candidate result tested via
(acl/permit? viewer :read doc) - federated query — fan out to peer instances via fed-sx, merge results
- merge policy — interleave by rank, dedupe by
(peer, doc-id) lib/search/tests/integration.sx— federated search with ACL filter
Progress log
- Phase 2 complete — parser (78/78 total). Query tokenizer (ord-based
delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
implicit AND on adjacency, parens, case-insensitive keywords.
parseQuery,searchQuery,showQ(canonical render for AST tests). 32 tests in parse.sx. haskell-on-sx parser gotchas hit while writing this (see parse.sx header): (1) escaped char literals like'\"'break the tokenizer — match delimiters byord c == 34; (2) an[]pattern inside acasealt breaks the parser — use multi-clause functions instead; (3)case/constructor patterns andlet (a,b)=..are fine. Embedded Haskell string literals in a.sxsource string need single\", not\\\". - Phase 2 boolean/phrase eval (46/46 total). Query ADT
Term|And|Or|Not|Phrase+evalQuery :: Index -> Query -> [DocId]in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to batch all cases into one program eval (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box. - Phase 1 complete (18/18). Tokenizer (lowercase + strip punctuation + positions),
inverted index as sorted assoc-list
[(Term,[(DocId,[Pos])])], indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled intosearch/srcand evaluated via the haskell-on-sx interpreter; tests reusehk-testcounters and asearch-evalhelper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).
Blockers
- None. Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow.
- Data.Map public API gap (informational, not fixing): the haskell-on-sx
import Data.Mapbinds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.