Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 31s
prefixTerms matches indexed terms by prefix (allTerms + isPrefixOf); prefixDocs unions their docs; prefixRankTfIdf ranks via the matched terms. 136/136. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
171 lines
9.6 KiB
Markdown
171 lines
9.6 KiB
Markdown
# search-on-sx: Full-text + structured search on Haskell
|
||
|
||
rose-ash needs search across pages, posts, threads, federated content. Tokenize,
|
||
index, query, rank, filter by visibility. Typed ADTs make query parsing clean,
|
||
lazy lists make posting-list iteration efficient, and Haskell-on-SX is at 1514/1514.
|
||
|
||
End-state: a Haskell-on-SX layer with inverted index, query AST, boolean +
|
||
phrase + ranked queries (TF-IDF, BM25), ACL-aware post-filter, and a federation
|
||
extension that merges per-peer indices.
|
||
|
||
## Status (rolling)
|
||
|
||
`bash lib/search/conformance.sh` → **122/122** (Phases 1–4 complete)
|
||
|
||
## Ground rules
|
||
|
||
- **Scope:** only touch `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit
|
||
`spec/`, `hosts/`, `shared/`, `lib/haskell/**`, or other `lib/<lang>/`. You may
|
||
**import** from `lib/haskell/` (public API in `lib/haskell/haskell.sx`); do **not**
|
||
modify Haskell.
|
||
- **Shared-file issues** go under "Blockers" with a minimal repro; do not fix here.
|
||
- **SX files:** use `sx-tree` MCP tools only.
|
||
- **Architecture:** index = `Map Term [(DocId, [Pos])]`. Query AST = ADT. Eval =
|
||
fold of posting lists with set ops + ranking math. Ranking is pure (no IO until
|
||
result emission).
|
||
- **Commits:** one feature per commit. Keep Progress log updated and tick boxes.
|
||
|
||
## Architecture sketch
|
||
|
||
```
|
||
Document Query
|
||
{:id :text :tags} "alice AND bob OR phrase \"x y\""
|
||
│ │
|
||
▼ ▼
|
||
lib/search/tokenize.sx lib/search/parse.sx
|
||
— tokenize :: Text → [Term] — parse :: Text → Query
|
||
— normalize (lowercase, strip) — Query = Term | And | Or
|
||
— (optionally) stem | Not | Phrase
|
||
│ │
|
||
▼ ▼
|
||
lib/search/index.sx lib/search/eval.sx
|
||
— Map Term [(DocId, [Pos])] — eval :: Index → Query → [DocId]
|
||
— insert / delete / lookup — boolean + phrase positions
|
||
— persistence (optional later) │
|
||
│ ▼
|
||
└────────────────► lib/search/rank.sx
|
||
— TF-IDF / BM25 scoring
|
||
— top-N
|
||
│
|
||
▼
|
||
lib/search/api.sx
|
||
— (search/index doc)
|
||
— (search/query q)
|
||
— (search/top n q)
|
||
│
|
||
▼
|
||
lib/search/fed.sx
|
||
— federated query (merge peer results)
|
||
— ACL filter post-merge
|
||
```
|
||
|
||
## Phase 1 — Tokenize + index
|
||
|
||
- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
|
||
whitespace, return positions
|
||
- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`,
|
||
`lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks
|
||
toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used —
|
||
the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.)
|
||
- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry
|
||
points `indexDoc` / `lookupTerm`
|
||
- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update,
|
||
delete, multi-doc, positions, docFreq, allTerms
|
||
- [x] `lib/search/scoreboard.{json,md}`
|
||
- [x] `lib/search/conformance.sh`
|
||
|
||
## Phase 2 — Query AST + boolean evaluation
|
||
|
||
- [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query |
|
||
Phrase [String]` (in `lib/search/query.sx`)
|
||
- [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent
|
||
(OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens,
|
||
case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ`
|
||
- [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists
|
||
(sortedUnion/Inter/Diff, Not over allDocs universe)
|
||
- [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
|
||
- [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase,
|
||
composition (parser edge cases move to the parse.sx suite)
|
||
|
||
## Phase 3 — Ranking
|
||
|
||
- [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index
|
||
(posting-list length); no separate df store needed
|
||
- [x] TF-IDF scoring (`rankTfIdf`)
|
||
- [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`)
|
||
- [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak)
|
||
- [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
|
||
+ tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N
|
||
|
||
## Phase 4 — ACL filter + federation
|
||
|
||
- [x] post-filter — `aclFilter`/`searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl` take an
|
||
injected `permit :: DocId -> Bool` predicate, applied post-rank (never in the index)
|
||
- [x] federated query — `fedIndex :: [(PeerId, Index)] -> Index` merges per-peer
|
||
inverted indices (union posting lists per term); rank/search run once over the merge
|
||
- [x] merge policy — relabel local DocIds to global `gid = peer*1000 + local`
|
||
(bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score
|
||
- [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup,
|
||
position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25
|
||
|
||
## Extensions (post-roadmap, search-shaped vocabulary)
|
||
|
||
- [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests
|
||
- [ ] fuzzy matching — edit distance term expansion
|
||
- [ ] result pagination (offset / limit)
|
||
- [ ] snippet / highlight generation
|
||
- [ ] stemming (suffix stripping) — recall-improving normalizer
|
||
|
||
## Progress log
|
||
|
||
- **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every
|
||
indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions
|
||
their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests.
|
||
- **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex`
|
||
merges per-peer inverted indices (union posting lists per term) after relabelling
|
||
local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id)
|
||
dedupe automatic and keeps positions, so ranking runs once over the merge and
|
||
interleaves peers by score (rank-correct). ACL is a post-rank `filter` over an
|
||
injected `permit :: DocId -> Bool` (viewer baked in by the caller) — never in the
|
||
index; `searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl`. 21 integration tests.
|
||
- **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25
|
||
(`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores
|
||
as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via
|
||
sortBy+take. df/idf derived from posting-list length (no separate df store). 23
|
||
tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter
|
||
effect. Float division/`log`/float literals all work in haskell-on-sx.
|
||
- **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based
|
||
delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
|
||
implicit AND on adjacency, parens, case-insensitive keywords. `parseQuery`,
|
||
`searchQuery`, `showQ` (canonical render for AST tests). 32 tests in parse.sx.
|
||
**haskell-on-sx parser gotchas hit while writing this (see parse.sx header):**
|
||
(1) escaped char literals like `'\"'` break the tokenizer — match delimiters by
|
||
`ord c == 34`; (2) an `[]` *pattern* inside a `case` alt breaks the parser — use
|
||
multi-clause functions instead; (3) `case`/constructor patterns and `let (a,b)=..`
|
||
are fine. Embedded Haskell string literals in a `.sx` source string need single
|
||
`\"`, not `\\\"`.
|
||
- **Phase 2 boolean/phrase eval (46/46 total).** Query ADT
|
||
`Term|And|Or|Not|Phrase` + `evalQuery :: Index -> Query -> [DocId]` in query.sx.
|
||
Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from
|
||
the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx.
|
||
Refactored both suites to **batch all cases into one program eval** (search-batch
|
||
in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate
|
||
hk-eval-program calls timed out; one combined eval per suite is ~20× faster.
|
||
Parser (parse.sx) is the remaining Phase 2 box.
|
||
- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions),
|
||
inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/
|
||
lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src`
|
||
and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a
|
||
`search-eval` helper that forces HK values to plain SX. conformance.sh models
|
||
lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).
|
||
|
||
## Blockers
|
||
|
||
- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents
|
||
(load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite
|
||
timeout is set to 600s. Runs are correct, just slow.
|
||
- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx
|
||
`import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/
|
||
insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index
|
||
uses a pure assoc-list instead so term iteration and federation merge stay simple.
|