Files

Test, Build, and Deploy / test-build-deploy (push) Failing after 24s

Details

nearDocs k t1 t2 returns docs where both terms occur within k positions
(unordered); candidates from the posting intersection, filtered on positional
postings. 205/205.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-06 23:01:42 +00:00

11 KiB

Raw Blame History

search-on-sx: Full-text + structured search on Haskell

rose-ash needs search across pages, posts, threads, federated content. Tokenize, index, query, rank, filter by visibility. Typed ADTs make query parsing clean, lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.

End-state: a Haskell-on-SX layer with inverted index, query AST, boolean + phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation extension that merges per-peer indices.

Status (rolling)

bash lib/search/conformance.sh → 122/122 (Phases 1–4 complete)

Ground rules

Scope: only touch lib/search/** and plans/search-on-sx.md. Do not edit spec/, hosts/, shared/, lib/haskell/**, or other lib/<lang>/. You may import from lib/haskell/ (public API in lib/haskell/haskell.sx); do not modify Haskell.
Shared-file issues go under "Blockers" with a minimal repro; do not fix here.
SX files: use sx-tree MCP tools only.
Architecture: index = Map Term [(DocId, [Pos])]. Query AST = ADT. Eval = fold of posting lists with set ops + ranking math. Ranking is pure (no IO until result emission).
Commits: one feature per commit. Keep Progress log updated and tick boxes.

Architecture sketch

Document                               Query
  {:id :text :tags}                       "alice AND bob OR phrase \"x y\""
        │                                       │
        ▼                                       ▼
lib/search/tokenize.sx                  lib/search/parse.sx
  — tokenize :: Text → [Term]             — parse :: Text → Query
  — normalize (lowercase, strip)          — Query = Term | And | Or
  — (optionally) stem                              | Not | Phrase
        │                                       │
        ▼                                       ▼
lib/search/index.sx                     lib/search/eval.sx
  — Map Term [(DocId, [Pos])]             — eval :: Index → Query → [DocId]
  — insert / delete / lookup              — boolean + phrase positions
  — persistence (optional later)                 │
        │                                       ▼
        └────────────────► lib/search/rank.sx
                            — TF-IDF / BM25 scoring
                            — top-N
                                  │
                                  ▼
                          lib/search/api.sx
                            — (search/index doc)
                            — (search/query q)
                            — (search/top n q)
                                  │
                                  ▼
                          lib/search/fed.sx
                            — federated query (merge peer results)
                            — ACL filter post-merge

Phase 1 — Tokenize + index

lib/search/tokenize.sx — normalize (lowercase, strip punctuation), split on whitespace, return positions
lib/search/index.sx — inverted index data structure; indexDoc, deleteDoc, lookupTerm, docFreq, allTerms. (Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list [(Term,[(DocId,[Pos])])] is used — the conceptual Map Term [(DocId,[Pos])] with free term iteration.)
lib/search/api.sx — assembles search/src (tokenize + index); Haskell entry points indexDoc / lookupTerm
lib/search/tests/index.sx — 18 cases: tokenize, insert + lookup, update, delete, multi-doc, positions, docFreq, allTerms
lib/search/scoreboard.{json,md}
lib/search/conformance.sh

Phase 2 — Query AST + boolean evaluation

Query ADT: Term String | And Query Query | Or Query Query | Not Query | Phrase [String] (in lib/search/query.sx)
lib/search/parse.sx — query syntax parser: tokenizer + recursive-descent (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, case-insensitive keywords); parseQuery, searchQuery, showQ
lib/search/query.sx — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe)
phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
lib/search/tests/boolean.sx — 28 cases: term, and, or, not, phrase, composition (parser edge cases move to the parse.sx suite)

Phase 3 — Ranking

document frequency — docFreq/idf/bm25idf derived from the index (posting-list length); no separate df store needed
TF-IDF scoring (rankTfIdf)
BM25 scoring, configurable k1/b (rankBm25 k1 b)
top-N retrieval (topNTfIdf/topNBm25 — sortBy + take; stable DocId tiebreak)
lib/search/tests/rank.sx — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
- tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N

Phase 4 — ACL filter + federation

post-filter — aclFilter/searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl take an injected permit :: DocId -> Bool predicate, applied post-rank (never in the index)
federated query — fedIndex :: [(PeerId, Index)] -> Index merges per-peer inverted indices (union posting lists per term); rank/search run once over the merge
merge policy — relabel local DocIds to global gid = peer*1000 + local (bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score
lib/search/tests/integration.sx — 21 cases: index merge, cross-peer df/lookup, position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25

Extensions (post-roadmap, search-shaped vocabulary)

prefix / wildcard queries (prefixTerms, prefixDocs, prefixRankTfIdf) — 14 tests
fuzzy matching — edit distance term expansion (editDist, fuzzyTerms, fuzzyDocs, fuzzyRankTfIdf) — 18 tests
result pagination (offset / limit) — paginate, pageTfIdf, pageBm25, resultCount — 12 tests
snippet / highlight generation (highlight, snippet) — 12 tests
stemming (suffix stripping) — stem, stemText, stemTokens, indexStemmed — 18 tests
proximity / NEAR — nearDocs k t1 t2 (unordered, within k positions) — 9 tests

Progress log

Extension: proximity/NEAR (205/205 total). nearDocs k t1 t2 idx returns docs where both terms occur within k positions (unordered), candidates = posting intersection, filtered on the positional postings. 9 tests.
Extension: stemming (196/196 total). Deterministic English suffix stripping (stem), stemText/stemTokens, indexStemmed. Two haskell-on-sx gotchas: take/drop over a String yield char CODES not char strings (rebuild via joinChars . map chr), and isSuffixOf's reverse trips ++ on the String repr (manual suffix compare). All five planned extensions now done; the loop can keep adding search vocabulary. 18 tests.
Extension: highlight/snippet (178/178 total). highlight terms text marks query-matching (normalized) tokens with [..]; snippet ctx terms text extracts a context window around the first match. 12 tests.
Extension: fuzzy matching (166/166 total). Levenshtein editDist as an O(m*n) row-based DP (the naive recursive version is exponential and times out under load), fuzzyTerms/fuzzyDocs/fuzzyRankTfIdf expand a term to indexed terms within a max edit distance. 18 tests.
Extension: pagination (148/148 total). paginate off lim windows a ranked list (take lim . drop off); pageTfIdf/pageBm25 + resultCount. 12 tests. Note the full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s under the heavy box load.
Extension: prefix/wildcard queries (136/136 total). prefixTerms matches every indexed term starting with a prefix (via allTerms + isPrefixOf); prefixDocs unions their docs; prefixRankTfIdf ranks treating the matched terms as the query. 14 tests.
Phase 4 complete — federation + ACL (122/122 total). Roadmap done. fedIndex merges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to global gid = peer*1000 + local — the bijection makes (peer,doc-id) dedupe automatic and keeps positions, so ranking runs once over the merge and interleaves peers by score (rank-correct). ACL is a post-rank filter over an injected permit :: DocId -> Bool (viewer baked in by the caller) — never in the index; searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl. 21 integration tests.
Phase 3 complete — ranking (101/101 total). TF-IDF (rankTfIdf) and BM25 (rankBm25 k1 b) over the candidate set (docs containing any query term), scores as floats with deterministic DocId-ascending tiebreak; topNTfIdf/topNBm25 via sortBy+take. df/idf derived from posting-list length (no separate df store). 23 tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter effect. Float division/log/float literals all work in haskell-on-sx.
Phase 2 complete — parser (78/78 total). Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence, implicit AND on adjacency, parens, case-insensitive keywords. parseQuery, searchQuery, showQ (canonical render for AST tests). 32 tests in parse.sx. haskell-on-sx parser gotchas hit while writing this (see parse.sx header): (1) escaped char literals like '\"' break the tokenizer — match delimiters by ord c == 34; (2) an [] pattern inside a case alt breaks the parser — use multi-clause functions instead; (3) case/constructor patterns and let (a,b)=.. are fine. Embedded Haskell string literals in a .sx source string need single \", not \\\".
Phase 2 boolean/phrase eval (46/46 total). Query ADT Term|And|Or|Not|Phrase + evalQuery :: Index -> Query -> [DocId] in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. Refactored both suites to batch all cases into one program eval (search-batch in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate hk-eval-program calls timed out; one combined eval per suite is ~20× faster. Parser (parse.sx) is the remaining Phase 2 box.
Phase 1 complete (18/18). Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list [(Term,[(DocId,[Pos])])], indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into search/src and evaluated via the haskell-on-sx interpreter; tests reuse hk-test counters and a search-eval helper that forces HK values to plain SX. conformance.sh models lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).

Blockers

None. Note: the box is heavily CPU-oversubscribed by sibling loop agents (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite timeout is set to 600s. Runs are correct, just slow.
Data.Map public API gap (informational, not fixing): the haskell-on-sx import Data.Map binds only empty/singleton/insert/lookup/member/size/null/delete/ insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index uses a pure assoc-list instead so term iteration and federation merge stay simple.

11 KiB Raw Blame History Unescape Escape