Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 24s
nearDocs k t1 t2 returns docs where both terms occur within k positions (unordered); candidates from the posting intersection, filtered on positional postings. 205/205. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
11 KiB
search-on-sx: Full-text + structured search on Haskell
rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.
End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices.
Status (rolling)
bash lib/search/conformance.sh → 122/122 (Phases 1–4 complete)
Ground rules
- Scope: only touch
lib/search/**andplans/search-on-sx.md. Do not editspec/,hosts/,shared/,lib/haskell/**, or otherlib/<lang>/. You may import fromlib/haskell/(public API inlib/haskell/haskell.sx); do not modify Haskell. - Shared-file issues go under "Blockers" with a minimal repro; do not fix here.
- SX files: use
sx-treeMCP tools only. - Architecture: index =
Map Term [(DocId, [Pos])]. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission). - Commits: one feature per commit. Keep Progress log updated and tick boxes.
Architecture sketch
Document Query
{:id :text :tags} "alice AND bob OR phrase \"x y\""
│ │
▼ ▼
lib/search/tokenize.sx lib/search/parse.sx
— tokenize :: Text → [Term] — parse :: Text → Query
— normalize (lowercase, strip) — Query = Term | And | Or
— (optionally) stem | Not | Phrase
│ │
▼ ▼
lib/search/index.sx lib/search/eval.sx
— Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId]
— insert / delete / lookup — boolean + phrase positions
— persistence (optional later) │
│ ▼
└────────────────► lib/search/rank.sx
— TF-IDF / BM25 scoring
— top-N
│
▼
lib/search/api.sx
— (search/index doc)
— (search/query q)
— (search/top n q)
│
▼
lib/search/fed.sx
— federated query (merge peer results)
— ACL filter post-merge
Phase 1 — Tokenize + index
lib/search/tokenize.sx— normalize (lowercase, strip punctuation), split on whitespace, return positionslib/search/index.sx— inverted index data structure;indexDoc,deleteDoc,lookupTerm,docFreq,allTerms. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list[(Term,[(DocId,[Pos])])]is used — the conceptualMap Term [(DocId,[Pos])]with free term iteration.)lib/search/api.sx— assemblessearch/src(tokenize + index); Haskell entry pointsindexDoc/lookupTermlib/search/tests/index.sx— 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTermslib/search/scoreboard.{json,md}lib/search/conformance.sh
Phase 2 — Query AST + boolean evaluation
- Query ADT:
Term String | And Query Query | Or Query Query | Not Query | Phrase [String](inlib/search/query.sx) lib/search/parse.sx— query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords);parseQuery,searchQuery,showQlib/search/query.sx— boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe)- phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
lib/search/tests/boolean.sx— 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite)
Phase 3 — Ranking
- document frequency —
docFreq/idf/bm25idfderived from the index (posting-list length); no separate df store needed - TF-IDF scoring (
rankTfIdf) - BM25 scoring, configurable k1/b (
rankBm25 k1 b) - top-N retrieval (
topNTfIdf/topNBm25— sortBy + take; stable DocId tiebreak) lib/search/tests/rank.sx— 23 cases: TF-IDF tf/idf behavior, BM25 length-norm- tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N
Phase 4 — ACL filter + federation
- post-filter —
aclFilter/searchTfIdfAcl/topNTfIdfAcl/searchBm25Acltake an injectedpermit :: DocId -> Boolpredicate, applied post-rank (never in the index) - federated query —
fedIndex :: [(PeerId, Index)] -> Indexmerges per-peer inverted indices (union posting lists per term); rank/search run once over the merge - merge policy — relabel local DocIds to global
gid = peer*1000 + local(bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score lib/search/tests/integration.sx— 21 cases: index merge, cross-peer df/lookup, position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25
Extensions (post-roadmap, search-shaped vocabulary)
- prefix / wildcard queries (
prefixTerms,prefixDocs,prefixRankTfIdf) — 14 tests - fuzzy matching — edit distance term expansion (
editDist,fuzzyTerms,fuzzyDocs,fuzzyRankTfIdf) — 18 tests - result pagination (offset / limit) —
paginate,pageTfIdf,pageBm25,resultCount— 12 tests - snippet / highlight generation (
highlight,snippet) — 12 tests - stemming (suffix stripping) —
stem,stemText,stemTokens,indexStemmed— 18 tests - proximity / NEAR —
nearDocs k t1 t2(unordered, within k positions) — 9 tests
Progress log
- Extension: proximity/NEAR (205/205 total).
nearDocs k t1 t2 idxreturns docs where both terms occur within k positions (unordered), candidates = posting intersection, filtered on the positional postings. 9 tests. - Extension: stemming (196/196 total). Deterministic English suffix stripping
(
stem),stemText/stemTokens,indexStemmed. Two haskell-on-sx gotchas: take/drop over a String yield char CODES not char strings (rebuild viajoinChars . map chr), and isSuffixOf'sreversetrips++on the String repr (manual suffix compare). All five planned extensions now done; the loop can keep adding search vocabulary. 18 tests. - Extension: highlight/snippet (178/178 total).
highlight terms textmarks query-matching (normalized) tokens with [..];snippet ctx terms textextracts a context window around the first match. 12 tests. - Extension: fuzzy matching (166/166 total). Levenshtein
editDistas an O(m*n) row-based DP (the naive recursive version is exponential and times out under load),fuzzyTerms/fuzzyDocs/fuzzyRankTfIdfexpand a term to indexed terms within a max edit distance. 18 tests. - Extension: pagination (148/148 total).
paginate off limwindows a ranked list (take lim . drop off);pageTfIdf/pageBm25+resultCount. 12 tests. Note the full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s under the heavy box load. - Extension: prefix/wildcard queries (136/136 total).
prefixTermsmatches every indexed term starting with a prefix (via allTerms + isPrefixOf);prefixDocsunions their docs;prefixRankTfIdfranks treating the matched terms as the query. 14 tests. - Phase 4 complete — federation + ACL (122/122 total). Roadmap done.
fedIndexmerges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to globalgid = peer*1000 + local— the bijection makes (peer,doc-id) dedupe automatic and keeps positions, so ranking runs once over the merge and interleaves peers by score (rank-correct). ACL is a post-rankfilterover an injectedpermit :: DocId -> Bool(viewer baked in by the caller) — never in the index;searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl. 21 integration tests. - Phase 3 complete — ranking (101/101 total). TF-IDF (
rankTfIdf) and BM25 (rankBm25 k1 b) over the candidate set (docs containing any query term), scores as floats with deterministic DocId-ascending tiebreak;topNTfIdf/topNBm25via sortBy+take. df/idf derived from posting-list length (no separate df store). 23 tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter effect. Float division/log/float literals all work in haskell-on-sx. - Phase 2 complete — parser (78/78 total). Query tokenizer (ord-based
delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
implicit AND on adjacency, parens, case-insensitive keywords.
parseQuery,searchQuery,showQ(canonical render for AST tests). 32 tests in parse.sx. haskell-on-sx parser gotchas hit while writing this (see parse.sx header): (1) escaped char literals like'\"'break the tokenizer — match delimiters byord c == 34; (2) an[]pattern inside acasealt breaks the parser — use multi-clause functions instead; (3)case/constructor patterns andlet (a,b)=..are fine. Embedded Haskell string literals in a.sxsource string need single\", not\\\". - Phase 2 boolean/phrase eval (46/46 total). Query ADT
Term|And|Or|Not|Phrase+evalQuery :: Index -> Query -> [DocId]in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to batch all cases into one program eval (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box. - Phase 1 complete (18/18). Tokenizer (lowercase + strip punctuation + positions),
inverted index as sorted assoc-list
[(Term,[(DocId,[Pos])])], indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled intosearch/srcand evaluated via the haskell-on-sx interpreter; tests reusehk-testcounters and asearch-evalhelper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).
Blockers
- None. Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow.
- Data.Map public API gap (informational, not fixing): the haskell-on-sx
import Data.Mapbinds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.