Merge loops/search into architecture: search-on-sx full-text search on Haskell
Tokenizer + inverted index, query AST (boolean/phrase) + parser, TF-IDF/BM25 ranking + top-N, federation merge + ACL post-filter, and 9 extensions (prefix, pagination, fuzzy, highlight, stem, NEAR, synonyms, boolean-ranked search, did-you-mean). lib/search/conformance.sh => 234/234 across 14 suites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
110
plans/agent-briefings/search-loop.md
Normal file
110
plans/agent-briefings/search-loop.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# search-on-sx loop agent (single agent, queue-driven)
|
||||
|
||||
Role: iterates `plans/search-on-sx.md` forever. **Full-text + structured search on
|
||||
Haskell** — tokenize, inverted index, query AST, boolean + phrase + ranked
|
||||
queries (TF-IDF / BM25), ACL-aware post-filter, federated index merge. Typed ADTs
|
||||
make query parsing clean; lazy lists make posting-list iteration efficient. Sits on
|
||||
`lib/haskell/` (1514/1514 already green); adds a search-shaped vocabulary on top.
|
||||
|
||||
```
|
||||
description: search-on-sx queue loop
|
||||
subagent_type: general-purpose
|
||||
run_in_background: true
|
||||
isolation: worktree
|
||||
```
|
||||
|
||||
## Prompt
|
||||
|
||||
You are the sole background agent working `plans/search-on-sx.md`. Isolated
|
||||
worktree `/root/rose-ash-loops/search` on branch `loops/search`, forever, one
|
||||
commit per feature. Push to `origin/loops/search` after every commit. Never touch
|
||||
`main` or `architecture`.
|
||||
|
||||
## Restart baseline — check before iterating
|
||||
|
||||
1. Read `plans/search-on-sx.md` — roadmap + Progress log.
|
||||
2. `ls lib/search/` — pick up from the most advanced file.
|
||||
3. If `lib/search/tests/*.sx` exist, run them via `bash lib/search/conformance.sh`.
|
||||
Green before new work.
|
||||
4. If `lib/search/scoreboard.md` exists, that's your baseline.
|
||||
5. Read the `lib/haskell/` public API once — that's your substrate. `lib/haskell/
|
||||
haskell.sx` exists; also study `runtime.sx`, `eval.sx`, `parser.sx`, `infer.sx`,
|
||||
`match.sx`, `map.sx`, `set.sx`, `testlib.sx`. Learn how to declare ADTs, pattern
|
||||
match, and use the `Map`/`Set` helpers before writing index code. Verify the real
|
||||
exported names with sx_find_all / grep — don't assume from the plan's sketch.
|
||||
|
||||
## The queue
|
||||
|
||||
Phase order per `plans/search-on-sx.md`:
|
||||
|
||||
- **Phase 1** — tokenize + inverted index + simple term lookup
|
||||
(`Map Term [(DocId,[Pos])]`, insert/lookup, `(search/index doc)`,
|
||||
`(search/query term)`).
|
||||
- **Phase 2** — query AST + boolean/phrase eval (Term | And | Or | Not | Phrase;
|
||||
posting-list set ops; positional phrase match).
|
||||
- **Phase 3** — ranking (TF-IDF, BM25), top-N.
|
||||
- **Phase 4** — ACL-aware post-filter + federation (merge per-peer indices).
|
||||
|
||||
Within a phase, pick the checkbox that unlocks the most tests per effort.
|
||||
|
||||
Every iteration: implement → test → commit → tick `[ ]` → Progress log → next.
|
||||
|
||||
## Ground rules (hard)
|
||||
|
||||
- **Scope:** only `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit
|
||||
`spec/`, `hosts/`, `shared/`, other `lib/<lang>/` dirs, `lib/stdlib.sx`, or
|
||||
`lib/` root. May **import** from `lib/haskell/` only (its public API). Do **not**
|
||||
modify Haskell.
|
||||
- **NEVER call `sx_build`.** 600s watchdog. If the sx_server binary is broken →
|
||||
Blockers entry, stop. Run tests by invoking the sx_server binary directly from a
|
||||
conformance.sh (model it on `lib/haskell/conformance.sh`), pointing `SX_SERVER`
|
||||
at `/root/rose-ash/hosts/ocaml/_build/default/bin/sx_server.exe` — fresh
|
||||
worktrees have no `_build/`, so the relative path won't resolve.
|
||||
- **Shared-file issues** → plan's Blockers with minimal repro; don't fix here.
|
||||
- **SX files:** `sx-tree` MCP tools ONLY. **They take `file:` not `path:`** — a
|
||||
wrong key yields `Yojson Type_error("Expected string, got null")`, which looks
|
||||
like a broken binary but is just a param mismatch. `sx_validate` after edits.
|
||||
Path-based edits (`sx_replace_node`) count comment headers in their indices and
|
||||
can clobber the wrong node — re-read after, or prefer `sx_write_file` for small
|
||||
files.
|
||||
- **Unicode in `.sx`:** raw UTF-8 only, never `\uXXXX` escapes.
|
||||
- **Commit granularity:** one feature per commit. Short factual messages
|
||||
(`search: phrase query positional match + 7 tests`). Push to `origin/loops/search`.
|
||||
- **Plan file:** update Progress log (newest first) + tick boxes every commit.
|
||||
|
||||
## search-specific gotchas
|
||||
|
||||
- **Posting lists are the hot path.** Keep them sorted by DocId so boolean AND/OR
|
||||
are linear merges, not nested scans. Phrase match needs positions, so store
|
||||
`(DocId, [Pos])` — don't drop positions early to save space; you can't recover them.
|
||||
- **Tokenization decides recall.** Normalize consistently (lowercase, strip
|
||||
punctuation) on BOTH index and query side, or queries silently miss. Test the
|
||||
index/query symmetry explicitly.
|
||||
- **Ranking must be deterministic on ties.** TF-IDF/BM25 scores collide; always
|
||||
add a stable tiebreak (DocId ascending) or tests flake.
|
||||
- **ACL filter is per-viewer and post-ranking.** Filter the result list against the
|
||||
viewer, after scoring — never bake visibility into the index (the same index
|
||||
serves all viewers). Inject the permit predicate; don't hardwire an ACL module
|
||||
that doesn't exist yet.
|
||||
- **Federation merges indices, not results.** Merging per-peer inverted indices
|
||||
(union posting lists per term) is cleaner and rank-correct vs merging ranked
|
||||
result lists. Mock peer indices in tests.
|
||||
|
||||
## General gotchas (all loops)
|
||||
|
||||
- SX `do` = R7RS iteration. Use `begin` for multi-expr sequences.
|
||||
- `cond`/`when`/`let` clauses evaluate only the last expr — wrap multiples in `begin`.
|
||||
- `let` is parallel, not sequential — nest `let`s when a binding references an earlier one.
|
||||
- `env-bind!` creates a binding; `env-set!` mutates an existing one (walks scope chain).
|
||||
- `sx_validate` after every structural edit.
|
||||
- Namespace-prefix all guest helpers (`search/...`) — short/host-colliding names
|
||||
get silently shadowed or hang the runtime.
|
||||
|
||||
## Style
|
||||
|
||||
- No comments in `.sx` unless non-obvious.
|
||||
- No new planning docs — update `plans/search-on-sx.md` inline.
|
||||
- Short, factual commit messages.
|
||||
- One feature per iteration. Commit. Log. Push. Next.
|
||||
|
||||
Go. Start by reading the plan; find the first unchecked `[ ]`; implement it.
|
||||
@@ -10,7 +10,7 @@ extension that merges per-peer indices.
|
||||
|
||||
## Status (rolling)
|
||||
|
||||
`bash lib/search/conformance.sh` → **0/0** (not yet started)
|
||||
`bash lib/search/conformance.sh` → **122/122** (Phases 1–4 complete)
|
||||
|
||||
## Ground rules
|
||||
|
||||
@@ -61,46 +61,148 @@ lib/search/index.sx lib/search/eval.sx
|
||||
|
||||
## Phase 1 — Tokenize + index
|
||||
|
||||
- [ ] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
|
||||
- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
|
||||
whitespace, return positions
|
||||
- [ ] `lib/search/index.sx` — inverted index data structure (typed `Map` from
|
||||
haskell lib); `insert`, `delete`, `lookup`
|
||||
- [ ] `lib/search/api.sx` — `(search/index doc)`, `(search/lookup term)`
|
||||
- [ ] `lib/search/tests/index.sx` — 15+ cases: tokenize, insert + lookup, update,
|
||||
delete, multi-doc
|
||||
- [ ] `lib/search/scoreboard.{json,md}`
|
||||
- [ ] `lib/search/conformance.sh`
|
||||
- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`,
|
||||
`lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks
|
||||
toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used —
|
||||
the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.)
|
||||
- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry
|
||||
points `indexDoc` / `lookupTerm`
|
||||
- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update,
|
||||
delete, multi-doc, positions, docFreq, allTerms
|
||||
- [x] `lib/search/scoreboard.{json,md}`
|
||||
- [x] `lib/search/conformance.sh`
|
||||
|
||||
## Phase 2 — Query AST + boolean evaluation
|
||||
|
||||
- [ ] Query ADT: `Term Text | And Query Query | Or Query Query | Not Query |
|
||||
Phrase [Text]`
|
||||
- [ ] `lib/search/parse.sx` — query syntax parser (boolean operators, quoted phrases)
|
||||
- [ ] `lib/search/eval.sx` — boolean eval via set ops on posting lists
|
||||
- [ ] phrase eval — adjacency check using positions
|
||||
- [ ] `lib/search/tests/boolean.sx` — 25+ cases: term, and, or, not, phrase,
|
||||
composition, parser edge cases
|
||||
- [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query |
|
||||
Phrase [String]` (in `lib/search/query.sx`)
|
||||
- [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent
|
||||
(OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens,
|
||||
case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ`
|
||||
- [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists
|
||||
(sortedUnion/Inter/Diff, Not over allDocs universe)
|
||||
- [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt)
|
||||
- [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase,
|
||||
composition (parser edge cases move to the parse.sx suite)
|
||||
|
||||
## Phase 3 — Ranking
|
||||
|
||||
- [ ] document frequency tracking — extend index with `df` per term
|
||||
- [ ] TF-IDF scoring
|
||||
- [ ] BM25 scoring (configurable k1, b)
|
||||
- [ ] top-N retrieval (heap-based)
|
||||
- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF,
|
||||
ranking stability, top-N correctness
|
||||
- [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index
|
||||
(posting-list length); no separate df store needed
|
||||
- [x] TF-IDF scoring (`rankTfIdf`)
|
||||
- [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`)
|
||||
- [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak)
|
||||
- [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
|
||||
+ tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N
|
||||
|
||||
## Phase 4 — ACL filter + federation
|
||||
|
||||
- [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)`
|
||||
- [ ] federated query — fan out to peer instances via fed-sx, merge results
|
||||
- [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)`
|
||||
- [ ] `lib/search/tests/integration.sx` — federated search with ACL filter
|
||||
- [x] post-filter — `aclFilter`/`searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl` take an
|
||||
injected `permit :: DocId -> Bool` predicate, applied post-rank (never in the index)
|
||||
- [x] federated query — `fedIndex :: [(PeerId, Index)] -> Index` merges per-peer
|
||||
inverted indices (union posting lists per term); rank/search run once over the merge
|
||||
- [x] merge policy — relabel local DocIds to global `gid = peer*1000 + local`
|
||||
(bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score
|
||||
- [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup,
|
||||
position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25
|
||||
|
||||
## Extensions (post-roadmap, search-shaped vocabulary)
|
||||
|
||||
- [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests
|
||||
- [x] fuzzy matching — edit distance term expansion (`editDist`, `fuzzyTerms`,
|
||||
`fuzzyDocs`, `fuzzyRankTfIdf`) — 18 tests
|
||||
- [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`,
|
||||
`resultCount` — 12 tests
|
||||
- [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests
|
||||
- [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed`
|
||||
— 18 tests
|
||||
- [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests
|
||||
- [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests
|
||||
- [x] boolean-filtered ranked search — `queryTerms`, `searchRankTfIdf`,
|
||||
`searchRankBm25` (filter by boolean query, rank survivors by relevance) — 11 tests
|
||||
- [x] did-you-mean / spelling suggestion — `suggest`, `suggestN` (closest indexed
|
||||
terms by edit distance, alphabetical tiebreak) — 9 tests
|
||||
|
||||
## Progress log
|
||||
|
||||
(loop fills this in)
|
||||
- **Extension: did-you-mean / spelling suggestion (234/234 total).** `suggest`/`suggestN`
|
||||
rank indexed terms by edit distance to a (misspelled) query term, alphabetical
|
||||
tiebreak. 9 tests.
|
||||
- **Extension: boolean-filtered ranked search (225/225 total).** `searchRankTfIdf`/
|
||||
`searchRankBm25` parse a boolean query, filter docs via evalQuery, then rank the
|
||||
survivors by relevance over the query's leaf terms (`queryTerms`) — the real-world
|
||||
filter-then-rank pattern. 11 tests.
|
||||
- **Extension: synonyms/query expansion (214/214 total).** A synonym map
|
||||
`[(Term,[Term])]` expands a query term to itself + synonyms (`expandTerm`); `synDocs`
|
||||
unions, `synRankTfIdf` ranks the expanded set. 9 tests.
|
||||
- **Extension: proximity/NEAR (205/205 total).** `nearDocs k t1 t2 idx` returns docs
|
||||
where both terms occur within k positions (unordered), candidates = posting
|
||||
intersection, filtered on the positional postings. 9 tests.
|
||||
- **Extension: stemming (196/196 total).** Deterministic English suffix stripping
|
||||
(`stem`), `stemText`/`stemTokens`, `indexStemmed`. Two haskell-on-sx gotchas: take/drop
|
||||
over a String yield char CODES not char strings (rebuild via `joinChars . map chr`),
|
||||
and isSuffixOf's `reverse` trips `++` on the String repr (manual suffix compare). All
|
||||
five planned extensions now done; the loop can keep adding search vocabulary. 18 tests.
|
||||
- **Extension: highlight/snippet (178/178 total).** `highlight terms text` marks
|
||||
query-matching (normalized) tokens with [..]; `snippet ctx terms text` extracts a
|
||||
context window around the first match. 12 tests.
|
||||
- **Extension: fuzzy matching (166/166 total).** Levenshtein `editDist` as an O(m*n)
|
||||
row-based DP (the naive recursive version is exponential and times out under load),
|
||||
`fuzzyTerms`/`fuzzyDocs`/`fuzzyRankTfIdf` expand a term to indexed terms within a max
|
||||
edit distance. 18 tests.
|
||||
- **Extension: pagination (148/148 total).** `paginate off lim` windows a ranked list
|
||||
(take lim . drop off); `pageTfIdf`/`pageBm25` + `resultCount`. 12 tests. Note the
|
||||
full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s
|
||||
under the heavy box load.
|
||||
- **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every
|
||||
indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions
|
||||
their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests.
|
||||
- **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex`
|
||||
merges per-peer inverted indices (union posting lists per term) after relabelling
|
||||
local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id)
|
||||
dedupe automatic and keeps positions, so ranking runs once over the merge and
|
||||
interleaves peers by score (rank-correct). ACL is a post-rank `filter` over an
|
||||
injected `permit :: DocId -> Bool` (viewer baked in by the caller) — never in the
|
||||
index; `searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl`. 21 integration tests.
|
||||
- **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25
|
||||
(`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores
|
||||
as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via
|
||||
sortBy+take. df/idf derived from posting-list length (no separate df store). 23
|
||||
tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter
|
||||
effect. Float division/`log`/float literals all work in haskell-on-sx.
|
||||
- **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based
|
||||
delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
|
||||
implicit AND on adjacency, parens, case-insensitive keywords. `parseQuery`,
|
||||
`searchQuery`, `showQ` (canonical render for AST tests). 32 tests in parse.sx.
|
||||
**haskell-on-sx parser gotchas hit while writing this (see parse.sx header):**
|
||||
(1) escaped char literals like `'\"'` break the tokenizer — match delimiters by
|
||||
`ord c == 34`; (2) an `[]` *pattern* inside a `case` alt breaks the parser — use
|
||||
multi-clause functions instead; (3) `case`/constructor patterns and `let (a,b)=..`
|
||||
are fine. Embedded Haskell string literals in a `.sx` source string need single
|
||||
`\"`, not `\\\"`.
|
||||
- **Phase 2 boolean/phrase eval (46/46 total).** Query ADT
|
||||
`Term|And|Or|Not|Phrase` + `evalQuery :: Index -> Query -> [DocId]` in query.sx.
|
||||
Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from
|
||||
the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx.
|
||||
Refactored both suites to **batch all cases into one program eval** (search-batch
|
||||
in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate
|
||||
hk-eval-program calls timed out; one combined eval per suite is ~20× faster.
|
||||
Parser (parse.sx) is the remaining Phase 2 box.
|
||||
- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions),
|
||||
inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/
|
||||
lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src`
|
||||
and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a
|
||||
`search-eval` helper that forces HK values to plain SX. conformance.sh models
|
||||
lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).
|
||||
|
||||
## Blockers
|
||||
|
||||
(loop fills this in)
|
||||
- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents
|
||||
(load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite
|
||||
timeout is set to 600s. Runs are correct, just slow.
|
||||
- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx
|
||||
`import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/
|
||||
insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index
|
||||
uses a pure assoc-list instead so term iteration and federation merge stay simple.
|
||||
|
||||
Reference in New Issue
Block a user