From e2de5a4675bc6c37610c5886a9c9ec91360d7c66 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 17:27:20 +0000 Subject: [PATCH 01/15] briefings: add search-on-sx loop briefing Co-Authored-By: Claude Opus 4.8 (1M context) --- plans/agent-briefings/search-loop.md | 110 +++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 plans/agent-briefings/search-loop.md diff --git a/plans/agent-briefings/search-loop.md b/plans/agent-briefings/search-loop.md new file mode 100644 index 00000000..ee2346fa --- /dev/null +++ b/plans/agent-briefings/search-loop.md @@ -0,0 +1,110 @@ +# search-on-sx loop agent (single agent, queue-driven) + +Role: iterates `plans/search-on-sx.md` forever. **Full-text + structured search on +Haskell** — tokenize, inverted index, query AST, boolean + phrase + ranked +queries (TF-IDF / BM25), ACL-aware post-filter, federated index merge. Typed ADTs +make query parsing clean; lazy lists make posting-list iteration efficient. Sits on +`lib/haskell/` (1514/1514 already green); adds a search-shaped vocabulary on top. + +``` +description: search-on-sx queue loop +subagent_type: general-purpose +run_in_background: true +isolation: worktree +``` + +## Prompt + +You are the sole background agent working `plans/search-on-sx.md`. Isolated +worktree `/root/rose-ash-loops/search` on branch `loops/search`, forever, one +commit per feature. Push to `origin/loops/search` after every commit. Never touch +`main` or `architecture`. + +## Restart baseline — check before iterating + +1. Read `plans/search-on-sx.md` — roadmap + Progress log. +2. `ls lib/search/` — pick up from the most advanced file. +3. If `lib/search/tests/*.sx` exist, run them via `bash lib/search/conformance.sh`. + Green before new work. +4. If `lib/search/scoreboard.md` exists, that's your baseline. +5. Read the `lib/haskell/` public API once — that's your substrate. `lib/haskell/ + haskell.sx` exists; also study `runtime.sx`, `eval.sx`, `parser.sx`, `infer.sx`, + `match.sx`, `map.sx`, `set.sx`, `testlib.sx`. Learn how to declare ADTs, pattern + match, and use the `Map`/`Set` helpers before writing index code. Verify the real + exported names with sx_find_all / grep — don't assume from the plan's sketch. + +## The queue + +Phase order per `plans/search-on-sx.md`: + +- **Phase 1** — tokenize + inverted index + simple term lookup + (`Map Term [(DocId,[Pos])]`, insert/lookup, `(search/index doc)`, + `(search/query term)`). +- **Phase 2** — query AST + boolean/phrase eval (Term | And | Or | Not | Phrase; + posting-list set ops; positional phrase match). +- **Phase 3** — ranking (TF-IDF, BM25), top-N. +- **Phase 4** — ACL-aware post-filter + federation (merge per-peer indices). + +Within a phase, pick the checkbox that unlocks the most tests per effort. + +Every iteration: implement → test → commit → tick `[ ]` → Progress log → next. + +## Ground rules (hard) + +- **Scope:** only `lib/search/**` and `plans/search-on-sx.md`. Do **not** edit + `spec/`, `hosts/`, `shared/`, other `lib//` dirs, `lib/stdlib.sx`, or + `lib/` root. May **import** from `lib/haskell/` only (its public API). Do **not** + modify Haskell. +- **NEVER call `sx_build`.** 600s watchdog. If the sx_server binary is broken → + Blockers entry, stop. Run tests by invoking the sx_server binary directly from a + conformance.sh (model it on `lib/haskell/conformance.sh`), pointing `SX_SERVER` + at `/root/rose-ash/hosts/ocaml/_build/default/bin/sx_server.exe` — fresh + worktrees have no `_build/`, so the relative path won't resolve. +- **Shared-file issues** → plan's Blockers with minimal repro; don't fix here. +- **SX files:** `sx-tree` MCP tools ONLY. **They take `file:` not `path:`** — a + wrong key yields `Yojson Type_error("Expected string, got null")`, which looks + like a broken binary but is just a param mismatch. `sx_validate` after edits. + Path-based edits (`sx_replace_node`) count comment headers in their indices and + can clobber the wrong node — re-read after, or prefer `sx_write_file` for small + files. +- **Unicode in `.sx`:** raw UTF-8 only, never `\uXXXX` escapes. +- **Commit granularity:** one feature per commit. Short factual messages + (`search: phrase query positional match + 7 tests`). Push to `origin/loops/search`. +- **Plan file:** update Progress log (newest first) + tick boxes every commit. + +## search-specific gotchas + +- **Posting lists are the hot path.** Keep them sorted by DocId so boolean AND/OR + are linear merges, not nested scans. Phrase match needs positions, so store + `(DocId, [Pos])` — don't drop positions early to save space; you can't recover them. +- **Tokenization decides recall.** Normalize consistently (lowercase, strip + punctuation) on BOTH index and query side, or queries silently miss. Test the + index/query symmetry explicitly. +- **Ranking must be deterministic on ties.** TF-IDF/BM25 scores collide; always + add a stable tiebreak (DocId ascending) or tests flake. +- **ACL filter is per-viewer and post-ranking.** Filter the result list against the + viewer, after scoring — never bake visibility into the index (the same index + serves all viewers). Inject the permit predicate; don't hardwire an ACL module + that doesn't exist yet. +- **Federation merges indices, not results.** Merging per-peer inverted indices + (union posting lists per term) is cleaner and rank-correct vs merging ranked + result lists. Mock peer indices in tests. + +## General gotchas (all loops) + +- SX `do` = R7RS iteration. Use `begin` for multi-expr sequences. +- `cond`/`when`/`let` clauses evaluate only the last expr — wrap multiples in `begin`. +- `let` is parallel, not sequential — nest `let`s when a binding references an earlier one. +- `env-bind!` creates a binding; `env-set!` mutates an existing one (walks scope chain). +- `sx_validate` after every structural edit. +- Namespace-prefix all guest helpers (`search/...`) — short/host-colliding names + get silently shadowed or hang the runtime. + +## Style + +- No comments in `.sx` unless non-obvious. +- No new planning docs — update `plans/search-on-sx.md` inline. +- Short, factual commit messages. +- One feature per iteration. Commit. Log. Push. Next. + +Go. Start by reading the plan; find the first unchecked `[ ]`; implement it. From b8cf3eb1b86516176f310ccac640cfeee77356f8 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 18:21:49 +0000 Subject: [PATCH 02/15] search: Phase 1 tokenizer + inverted index + 18 tests Tokenizer (lowercase, strip punctuation, positions) and a sorted assoc-list inverted index [(Term,[(DocId,[Pos])])] with indexDoc/deleteDoc/lookupTerm/ docFreq/allTerms. Search lib is haskell-on-sx source assembled into search/src; tests reuse hk-test counters via a search-eval helper. conformance.sh models lib/haskell. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 +++ lib/search/conformance.conf | 29 +++++++++ lib/search/conformance.sh | 3 + lib/search/index.sx | 15 +++++ lib/search/scoreboard.json | 10 +++ lib/search/scoreboard.md | 7 +++ lib/search/testlib.sx | 29 +++++++++ lib/search/tests/index.sx | 119 ++++++++++++++++++++++++++++++++++++ lib/search/tokenize.sx | 8 +++ plans/search-on-sx.md | 36 +++++++---- 10 files changed, 252 insertions(+), 11 deletions(-) create mode 100644 lib/search/api.sx create mode 100644 lib/search/conformance.conf create mode 100755 lib/search/conformance.sh create mode 100644 lib/search/index.sx create mode 100644 lib/search/scoreboard.json create mode 100644 lib/search/scoreboard.md create mode 100644 lib/search/testlib.sx create mode 100644 lib/search/tests/index.sx create mode 100644 lib/search/tokenize.sx diff --git a/lib/search/api.sx b/lib/search/api.sx new file mode 100644 index 00000000..8a06d444 --- /dev/null +++ b/lib/search/api.sx @@ -0,0 +1,7 @@ +;; search public API — assembles the canonical Haskell source from all layers. +;; Tests and callers concatenate `search/src` with their own top-level bindings +;; (e.g. "result = lookupTerm \"cat\" idx\n") and evaluate via the haskell-on-sx +;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, +;; docFreq, allTerms, tokens, positioned. + +(define search/src (str search/tokenize-src "\n" search/index-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf new file mode 100644 index 00000000..cc75c6e0 --- /dev/null +++ b/lib/search/conformance.conf @@ -0,0 +1,29 @@ +# search-on-sx conformance config — sourced by lib/guest/conformance.sh. + +LANG_NAME=search +SCOREBOARD_DIR=lib/search +MODE=counters +COUNTERS_PASS=hk-test-pass +COUNTERS_FAIL=hk-test-fail +TIMEOUT_PER_SUITE=600 + +PRELOADS=( + lib/haskell/tokenizer.sx + lib/haskell/layout.sx + lib/haskell/parser.sx + lib/haskell/desugar.sx + lib/haskell/runtime.sx + lib/haskell/match.sx + lib/haskell/eval.sx + lib/haskell/map.sx + lib/haskell/set.sx + lib/haskell/testlib.sx + lib/search/tokenize.sx + lib/search/index.sx + lib/search/api.sx + lib/search/testlib.sx +) + +SUITES=( + "index:lib/search/tests/index.sx" +) diff --git a/lib/search/conformance.sh b/lib/search/conformance.sh new file mode 100755 index 00000000..e50befa3 --- /dev/null +++ b/lib/search/conformance.sh @@ -0,0 +1,3 @@ +#!/usr/bin/env bash +# Thin wrapper — see lib/guest/conformance.sh and lib/search/conformance.conf. +exec bash "$(dirname "$0")/../guest/conformance.sh" "$(dirname "$0")/conformance.conf" "$@" diff --git a/lib/search/index.sx b/lib/search/index.sx new file mode 100644 index 00000000..3d285ec9 --- /dev/null +++ b/lib/search/index.sx @@ -0,0 +1,15 @@ +;; search inverted index — Haskell source fragment (depends on tokenize). +;; Index = [(Term, [(DocId, [Pos])])], sorted by Term; postings sorted by DocId. +;; Data.Map's public API lacks toList/keys/map/filter, so a sorted assoc-list +;; index is used — it is the conceptual `Map Term [(DocId,[Pos])]` and exposes +;; term iteration (allTerms) and df naturally for ranking. +;; emptyIndex :: Index +;; indexDoc :: DocId -> String -> Index -> Index (re-index replaces) +;; lookupTerm :: Term -> Index -> [(DocId, [Pos])] +;; deleteDoc :: DocId -> Index -> Index +;; docFreq :: Term -> Index -> Int +;; allTerms :: Index -> [Term] + +(define + search/index-src + "emptyIndex = []\ngroupBump [] t p = [(t, [p])]\ngroupBump (g:gs) t p = if fst g == t then (t, snd g ++ [p]) : gs else g : groupBump gs t p\ngroupStep acc tp = groupBump acc (fst tp) (snd tp)\ngroupTok pairs = foldl groupStep [] pairs\ninsPosting d ps [] = [(d, ps)]\ninsPosting d ps (q:qs) = if d < fst q then (d, ps) : q : qs else if d == fst q then (d, ps) : qs else q : insPosting d ps qs\ninsTerm t d ps [] = [(t, [(d, ps)])]\ninsTerm t d ps (e:es) = if t < fst e then (t, [(d, ps)]) : e : es else if t == fst e then (fst e, insPosting d ps (snd e)) : es else e : insTerm t d ps es\nindexStep d ix tp = insTerm (fst tp) d (snd tp) ix\nindexDoc d text idx = foldl (indexStep d) idx (groupTok (positioned text))\nlookupTerm t idx = case lookup t idx of { Nothing -> []; Just pl -> pl }\ndocFreq t idx = length (lookupTerm t idx)\nallTerms idx = map fst idx\npostingKeep d q = fst q /= d\ndropTermDoc d e = (fst e, filter (postingKeep d) (snd e))\nplKeep e = not (null (snd e))\ndeleteDoc d idx = filter plKeep (map (dropTermDoc d) idx)\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json new file mode 100644 index 00000000..4c5202b0 --- /dev/null +++ b/lib/search/scoreboard.json @@ -0,0 +1,10 @@ +{ + "lang": "search", + "total_passed": 18, + "total_failed": 0, + "total": 18, + "suites": [ + {"name":"index","passed":18,"failed":0,"total":18} + ], + "generated": "2026-06-06T18:12:50+00:00" +} diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md new file mode 100644 index 00000000..cf9cabce --- /dev/null +++ b/lib/search/scoreboard.md @@ -0,0 +1,7 @@ +# search scoreboard + +**18 / 18 passing** (0 failure(s)). + +| Suite | Passed | Total | Status | +|-------|--------|-------|--------| +| index | 18 | 18 | ok | diff --git a/lib/search/testlib.sx b/lib/search/testlib.sx new file mode 100644 index 00000000..9c965b05 --- /dev/null +++ b/lib/search/testlib.sx @@ -0,0 +1,29 @@ +;; search test helpers — convert forced haskell values to plain SX and run +;; programs built on top of search/src. Reuses hk-test / counters from +;; lib/haskell/testlib.sx (preloaded by the conformance config). + +;; Recursively turn a forced HK value into plain SX: +;; cons-list -> SX list, Tuple -> SX list, leaves unchanged. +(define + search-hk->sx + (fn + (v) + (cond + ((and (list? v) (not (empty? v)) (= (first v) "[]")) (list)) + ((and (list? v) (not (empty? v)) (= (first v) ":")) + (cons + (search-hk->sx (nth v 1)) + (search-hk->sx (nth v 2)))) + ((and (list? v) (not (empty? v)) (= (first v) "Tuple")) + (map search-hk->sx (rest v))) + (:else v)))) + +;; Evaluate `extra` (extra top-level Haskell bindings) on top of search/src +;; and return binding `name` as plain SX. +(define + search-eval + (fn + (extra name) + (search-hk->sx + (hk-deep-force + (get (hk-eval-program (hk-core (str search/src extra))) name))))) diff --git a/lib/search/tests/index.sx b/lib/search/tests/index.sx new file mode 100644 index 00000000..2e9cb700 --- /dev/null +++ b/lib/search/tests/index.sx @@ -0,0 +1,119 @@ +;; Phase 1 — tokenize + inverted index. + +(hk-test + "tokens basic lowercases" + (search-eval "\nresult = tokens \"The Cat sat\"\n" "result") + (list "the" "cat" "sat")) + +(hk-test + "tokens strips punctuation" + (search-eval "\nresult = tokens \"Hello, World!\"\n" "result") + (list "hello" "world")) + +(hk-test + "tokens collapses whitespace" + (search-eval "\nresult = tokens \" a b \"\n" "result") + (list "a" "b")) + +(hk-test + "tokens empty is empty" + (search-eval "\nresult = tokens \"\"\n" "result") + (list)) + +(hk-test + "tokens keeps digits" + (search-eval "\nresult = tokens \"abc123 x9\"\n" "result") + (list "abc123" "x9")) + +(hk-test + "positioned attaches ordinals" + (search-eval "\nresult = positioned \"a b a\"\n" "result") + (list (list "a" 0) (list "b" 1) (list "a" 2))) + +(hk-test + "index + lookup single doc" + (search-eval + "\nresult = lookupTerm \"cat\" (indexDoc 1 \"the cat sat\" emptyIndex)\n" + "result") + (list (list 1 (list 1)))) + +(hk-test + "lookup missing term is empty" + (search-eval + "\nresult = lookupTerm \"dog\" (indexDoc 1 \"the cat sat\" emptyIndex)\n" + "result") + (list)) + +(hk-test + "lookup records all positions" + (search-eval + "\nresult = lookupTerm \"the\" (indexDoc 1 \"the cat the dog the\" emptyIndex)\n" + "result") + (list (list 1 (list 0 2 4)))) + +(hk-test + "multi-doc posting list sorted by docid" + (search-eval + "\nresult = lookupTerm \"x\" (indexDoc 1 \"x y\" (indexDoc 2 \"x z\" emptyIndex))\n" + "result") + (list + (list 1 (list 0)) + (list 2 (list 0)))) + +(hk-test + "index/query case symmetry" + (search-eval + "\nresult = lookupTerm \"cat\" (indexDoc 1 \"CAT Cat cat\" emptyIndex)\n" + "result") + (list (list 1 (list 0 1 2)))) + +(hk-test + "re-index replaces a doc" + (search-eval + "\nresult = lookupTerm \"a\" (indexDoc 1 \"a a a\" (indexDoc 1 \"a\" emptyIndex))\n" + "result") + (list (list 1 (list 0 1 2)))) + +(hk-test + "delete removes a doc" + (search-eval + "\nresult = lookupTerm \"cat\" (deleteDoc 1 (indexDoc 1 \"the cat\" emptyIndex))\n" + "result") + (list)) + +(hk-test + "delete leaves other docs" + (search-eval + "\nresult = lookupTerm \"cat\" (deleteDoc 2 (indexDoc 2 \"big cat\" (indexDoc 1 \"the cat\" emptyIndex)))\n" + "result") + (list (list 1 (list 1)))) + +(hk-test + "docFreq counts docs" + (search-eval + "\nresult = docFreq \"cat\" (indexDoc 2 \"a cat\" (indexDoc 1 \"the cat\" emptyIndex))\n" + "result") + 2) + +(hk-test + "docFreq zero for missing" + (search-eval + "\nresult = docFreq \"zzz\" (indexDoc 1 \"a b\" emptyIndex)\n" + "result") + 0) + +(hk-test + "allTerms sorted and unique" + (search-eval + "\nresult = allTerms (indexDoc 1 \"banana apple cherry apple\" emptyIndex)\n" + "result") + (list "apple" "banana" "cherry")) + +(hk-test + "allTerms merged across docs" + (search-eval + "\nresult = allTerms (indexDoc 2 \"d a\" (indexDoc 1 \"c b\" emptyIndex))\n" + "result") + (list "a" "b" "c" "d")) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/lib/search/tokenize.sx b/lib/search/tokenize.sx new file mode 100644 index 00000000..7c1d74d3 --- /dev/null +++ b/lib/search/tokenize.sx @@ -0,0 +1,8 @@ +;; search tokenizer — Haskell source fragment. +;; normalize (lowercase + strip punctuation), split on whitespace, attach positions. +;; tokens :: String -> [String] +;; positioned :: String -> [(String, Int)] -- 0-based ordinal positions + +(define + search/tokenize-src + "lowerChar c = chr (toLower (ord c))\nnormChar c = if isAlphaNum c then lowerChar c else ' '\nisBlankCh c = c == ' '\ndropBlanks [] = []\ndropBlanks (c:cs) = if isBlankCh c then dropBlanks cs else c:cs\ntakeWord [] = []\ntakeWord (c:cs) = if isBlankCh c then [] else c : takeWord cs\nafterWord [] = []\nafterWord (c:cs) = if isBlankCh c then c:cs else afterWord cs\nsplitWords s = let s2 = dropBlanks s in if null s2 then [] else takeWord s2 : splitWords (afterWord s2)\nappendStr a b = a ++ b\njoinChars cs = foldr appendStr \"\" cs\ntokens s = map joinChars (splitWords (map normChar s))\nposFrom i [] = []\nposFrom i (x:xs) = (x, i) : posFrom (i + 1) xs\npositioned s = posFrom 0 (tokens s)\n") diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 9e0045d4..1baf6e9a 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -10,7 +10,7 @@ extension that merges per-peer indices. ## Status (rolling) -`bash lib/search/conformance.sh` → **0/0** (not yet started) +`bash lib/search/conformance.sh` → **18/18** (Phase 1 complete) ## Ground rules @@ -61,15 +61,18 @@ lib/search/index.sx lib/search/eval.sx ## Phase 1 — Tokenize + index -- [ ] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on +- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on whitespace, return positions -- [ ] `lib/search/index.sx` — inverted index data structure (typed `Map` from - haskell lib); `insert`, `delete`, `lookup` -- [ ] `lib/search/api.sx` — `(search/index doc)`, `(search/lookup term)` -- [ ] `lib/search/tests/index.sx` — 15+ cases: tokenize, insert + lookup, update, - delete, multi-doc -- [ ] `lib/search/scoreboard.{json,md}` -- [ ] `lib/search/conformance.sh` +- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`, + `lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks + toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used — + the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.) +- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry + points `indexDoc` / `lookupTerm` +- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update, + delete, multi-doc, positions, docFreq, allTerms +- [x] `lib/search/scoreboard.{json,md}` +- [x] `lib/search/conformance.sh` ## Phase 2 — Query AST + boolean evaluation @@ -99,8 +102,19 @@ lib/search/index.sx lib/search/eval.sx ## Progress log -(loop fills this in) +- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions), + inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/ + lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src` + and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a + `search-eval` helper that forces HK values to plain SX. conformance.sh models + lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail). ## Blockers -(loop fills this in) +- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents + (load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite + timeout is set to 600s. Runs are correct, just slow. +- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx + `import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/ + insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index + uses a pure assoc-list instead so term iteration and federation merge stay simple. From 0f0da0319c83285644f5b8299b8c09b3816a93c5 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 18:47:42 +0000 Subject: [PATCH 03/15] search: Phase 2 query AST + boolean/phrase eval + 28 tests Query ADT (Term|And|Or|Not|Phrase) and evalQuery over docid-sorted posting lists: boolean ops as linear merges, Not over the allDocs universe, Phrase via positional adjacency. Batched both test suites into one program eval each (search-batch) so they finish under heavy CPU load. 46/46. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 +- lib/search/conformance.conf | 2 + lib/search/query.sx | 11 ++ lib/search/scoreboard.json | 9 +- lib/search/scoreboard.md | 3 +- lib/search/testlib.sx | 21 ++++ lib/search/tests/boolean.sx | 123 +++++++++++++++++++++++ lib/search/tests/index.sx | 193 +++++++++++++++--------------------- plans/search-on-sx.md | 21 ++-- 9 files changed, 264 insertions(+), 125 deletions(-) create mode 100644 lib/search/query.sx create mode 100644 lib/search/tests/boolean.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 8a06d444..e2da2bb6 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -2,6 +2,8 @@ ;; Tests and callers concatenate `search/src` with their own top-level bindings ;; (e.g. "result = lookupTerm \"cat\" idx\n") and evaluate via the haskell-on-sx ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, -;; docFreq, allTerms, tokens, positioned. +;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery. -(define search/src (str search/tokenize-src "\n" search/index-src)) +(define + search/src + (str search/tokenize-src "\n" search/index-src "\n" search/query-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index cc75c6e0..4e418e9f 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -20,10 +20,12 @@ PRELOADS=( lib/haskell/testlib.sx lib/search/tokenize.sx lib/search/index.sx + lib/search/query.sx lib/search/api.sx lib/search/testlib.sx ) SUITES=( "index:lib/search/tests/index.sx" + "boolean:lib/search/tests/boolean.sx" ) diff --git a/lib/search/query.sx b/lib/search/query.sx new file mode 100644 index 00000000..23025908 --- /dev/null +++ b/lib/search/query.sx @@ -0,0 +1,11 @@ +;; search query AST + boolean/phrase evaluation — Haskell source fragment. +;; Depends on tokenize + index. +;; data Query = Term String | And Query Query | Or Query Query +;; | Not Query | Phrase [String] +;; evalQuery :: Index -> Query -> [DocId] (sorted, unique) +;; Boolean ops are linear merges over docid-sorted posting lists; Not uses +;; allDocs as the universe; Phrase checks positional adjacency. + +(define + search/query-src + "data Query = Term String | And Query Query | Or Query Query | Not Query | Phrase [String]\ndocsWith t idx = map fst (lookupTerm t idx)\nsortedUnion [] ys = ys\nsortedUnion xs [] = xs\nsortedUnion (x:xs) (y:ys) = if x < y then x : sortedUnion xs (y:ys) else if x > y then y : sortedUnion (x:xs) ys else x : sortedUnion xs ys\nsortedInter [] ys = []\nsortedInter xs [] = []\nsortedInter (x:xs) (y:ys) = if x < y then sortedInter xs (y:ys) else if x > y then sortedInter (x:xs) ys else x : sortedInter xs ys\nsortedDiff [] ys = []\nsortedDiff xs [] = xs\nsortedDiff (x:xs) (y:ys) = if x < y then x : sortedDiff xs (y:ys) else if x > y then sortedDiff (x:xs) ys else sortedDiff xs ys\nmergeDocs acc e = sortedUnion acc (map fst (snd e))\nallDocs idx = foldl mergeDocs [] idx\nposIn t d idx = case lookup d (lookupTerm t idx) of { Nothing -> []; Just ps -> ps }\nelemSorted x [] = False\nelemSorted x (y:ys) = if x == y then True else if x < y then False else elemSorted x ys\nphraseAtAll [] d idx p i = True\nphraseAtAll (t:ts) d idx p i = if elemSorted (p + i) (posIn t d idx) then phraseAtAll ts d idx p (i + 1) else False\nphraseStartsAt ts d idx p = phraseAtAll ts d idx p 0\nphraseInDoc [] d idx = True\nphraseInDoc (t0:rest) d idx = any (phraseStartsAt (t0:rest) d idx) (posIn t0 d idx)\nphraseHere ts idx d = phraseInDoc ts d idx\ninterStep idx acc tt = sortedInter acc (docsWith tt idx)\nphraseCands [] idx = allDocs idx\nphraseCands (t:ts) idx = foldl (interStep idx) (docsWith t idx) ts\nphraseDocs ts idx = filter (phraseHere ts idx) (phraseCands ts idx)\nevalQuery idx q = case q of { Term t -> docsWith t idx ; And a b -> sortedInter (evalQuery idx a) (evalQuery idx b) ; Or a b -> sortedUnion (evalQuery idx a) (evalQuery idx b) ; Not a -> sortedDiff (allDocs idx) (evalQuery idx a) ; Phrase ts -> phraseDocs ts idx }\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 4c5202b0..51e8a2ec 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,10 +1,11 @@ { "lang": "search", - "total_passed": 18, + "total_passed": 46, "total_failed": 0, - "total": 18, + "total": 46, "suites": [ - {"name":"index","passed":18,"failed":0,"total":18} + {"name":"index","passed":18,"failed":0,"total":18}, + {"name":"boolean","passed":28,"failed":0,"total":28} ], - "generated": "2026-06-06T18:12:50+00:00" + "generated": "2026-06-06T18:46:54+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index cf9cabce..a214ce29 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,7 +1,8 @@ # search scoreboard -**18 / 18 passing** (0 failure(s)). +**46 / 46 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| | index | 18 | 18 | ok | +| boolean | 28 | 28 | ok | diff --git a/lib/search/testlib.sx b/lib/search/testlib.sx index 9c965b05..1e2212d0 100644 --- a/lib/search/testlib.sx +++ b/lib/search/testlib.sx @@ -27,3 +27,24 @@ (search-hk->sx (hk-deep-force (get (hk-eval-program (hk-core (str search/src extra))) name))))) + +(define + search-join + (fn + (sep xs) + (cond + ((empty? xs) "") + ((empty? (rest xs)) (first xs)) + (:else (str (first xs) sep (search-join sep (rest xs))))))) + +;; Batch many haskell expressions into ONE program evaluation (amortizes the +;; cost of parsing/binding search/src — important under heavy CPU load). +;; `setup` is extra top-level Haskell; `exprs` is a list of expression strings +;; whose results form a single haskell list. Returns the SX list of results. +(define + search-batch + (fn + (setup exprs) + (search-eval + (str setup "\nresult = [" (search-join ", " exprs) "]\n") + "result"))) diff --git a/lib/search/tests/boolean.sx b/lib/search/tests/boolean.sx new file mode 100644 index 00000000..f6e48ea8 --- /dev/null +++ b/lib/search/tests/boolean.sx @@ -0,0 +1,123 @@ +;; Phase 2 — query AST + boolean/phrase evaluation (hand-built Query values). +;; Corpus: +;; doc 1 "the quick brown dog" -> the quick brown dog +;; doc 2 "a quick brown fox" -> a quick brown fox +;; doc 3 "the dog barks loudly" -> the dog barks loudly +;; All queries run in ONE program evaluation (search-batch) to stay fast. + +(define + search-corpus + "idx = indexDoc 3 \"the dog barks loudly\" (indexDoc 2 \"a quick brown fox\" (indexDoc 1 \"the quick brown dog\" emptyIndex))\n") + +(define + bool-cases + (list + (list + "term in two docs" + "evalQuery idx (Term \"quick\")" + (list 1 2)) + (list + "term in two docs (the)" + "evalQuery idx (Term \"the\")" + (list 1 3)) + (list "term in one doc" "evalQuery idx (Term \"fox\")" (list 2)) + (list "term absent" "evalQuery idx (Term \"zzz\")" (list)) + (list + "term case-sensitive at AST level" + "evalQuery idx (Term \"QUICK\")" + (list)) + (list "term on empty index" "evalQuery emptyIndex (Term \"cat\")" (list)) + (list + "and both terms" + "evalQuery idx (And (Term \"quick\") (Term \"brown\"))" + (list 1 2)) + (list + "and overlap subset" + "evalQuery idx (And (Term \"the\") (Term \"dog\"))" + (list 1 3)) + (list + "and disjoint is empty" + "evalQuery idx (And (Term \"the\") (Term \"fox\"))" + (list)) + (list + "and right-nested" + "evalQuery idx (And (Term \"the\") (And (Term \"dog\") (Term \"barks\")))" + (list 3)) + (list + "or two singletons" + "evalQuery idx (Or (Term \"fox\") (Term \"barks\"))" + (list 2 3)) + (list + "or all docs" + "evalQuery idx (Or (Term \"quick\") (Term \"the\"))" + (list 1 2 3)) + (list + "or with absent term" + "evalQuery idx (Or (Term \"fox\") (Term \"zzz\"))" + (list 2)) + (list "not term" "evalQuery idx (Not (Term \"the\"))" (list 2)) + (list "not term 2" "evalQuery idx (Not (Term \"quick\"))" (list 3)) + (list + "and with not" + "evalQuery idx (And (Term \"quick\") (Not (Term \"the\")))" + (list 2)) + (list + "double negation" + "evalQuery idx (Not (Not (Term \"fox\")))" + (list 2)) + (list + "or of and with term" + "evalQuery idx (Or (And (Term \"the\") (Term \"dog\")) (Term \"fox\"))" + (list 1 2 3)) + (list + "phrase adjacent both docs" + "evalQuery idx (Phrase [\"quick\", \"brown\"])" + (list 1 2)) + (list + "phrase adjacent one doc" + "evalQuery idx (Phrase [\"brown\", \"dog\"])" + (list 1)) + (list + "phrase the quick" + "evalQuery idx (Phrase [\"the\", \"quick\"])" + (list 1)) + (list + "phrase dog barks" + "evalQuery idx (Phrase [\"dog\", \"barks\"])" + (list 3)) + (list + "phrase non-adjacent empty" + "evalQuery idx (Phrase [\"quick\", \"dog\"])" + (list)) + (list + "phrase order matters" + "evalQuery idx (Phrase [\"brown\", \"quick\"])" + (list)) + (list + "phrase single term" + "evalQuery idx (Phrase [\"dog\"])" + (list 1 3)) + (list + "phrase three terms" + "evalQuery idx (Phrase [\"the\", \"dog\", \"barks\"])" + (list 3)) + (list + "and of phrase and term" + "evalQuery idx (And (Phrase [\"quick\", \"brown\"]) (Term \"dog\"))" + (list 1)) + (list + "not of phrase" + "evalQuery idx (Not (Phrase [\"quick\", \"brown\"]))" + (list 3)))) + +(define + bool-results + (search-batch search-corpus (map (fn (c) (nth c 1)) bool-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth bool-results i) (nth c 2))) + bool-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/lib/search/tests/index.sx b/lib/search/tests/index.sx index 2e9cb700..9415866f 100644 --- a/lib/search/tests/index.sx +++ b/lib/search/tests/index.sx @@ -1,119 +1,88 @@ ;; Phase 1 — tokenize + inverted index. +;; All cases run in ONE program evaluation (search-batch) to stay fast under load. +;; Scalar results (docFreq) are wrapped as singleton lists so the batch is a list +;; of lists. -(hk-test - "tokens basic lowercases" - (search-eval "\nresult = tokens \"The Cat sat\"\n" "result") - (list "the" "cat" "sat")) - -(hk-test - "tokens strips punctuation" - (search-eval "\nresult = tokens \"Hello, World!\"\n" "result") - (list "hello" "world")) - -(hk-test - "tokens collapses whitespace" - (search-eval "\nresult = tokens \" a b \"\n" "result") - (list "a" "b")) - -(hk-test - "tokens empty is empty" - (search-eval "\nresult = tokens \"\"\n" "result") - (list)) - -(hk-test - "tokens keeps digits" - (search-eval "\nresult = tokens \"abc123 x9\"\n" "result") - (list "abc123" "x9")) - -(hk-test - "positioned attaches ordinals" - (search-eval "\nresult = positioned \"a b a\"\n" "result") - (list (list "a" 0) (list "b" 1) (list "a" 2))) - -(hk-test - "index + lookup single doc" - (search-eval - "\nresult = lookupTerm \"cat\" (indexDoc 1 \"the cat sat\" emptyIndex)\n" - "result") - (list (list 1 (list 1)))) - -(hk-test - "lookup missing term is empty" - (search-eval - "\nresult = lookupTerm \"dog\" (indexDoc 1 \"the cat sat\" emptyIndex)\n" - "result") - (list)) - -(hk-test - "lookup records all positions" - (search-eval - "\nresult = lookupTerm \"the\" (indexDoc 1 \"the cat the dog the\" emptyIndex)\n" - "result") - (list (list 1 (list 0 2 4)))) - -(hk-test - "multi-doc posting list sorted by docid" - (search-eval - "\nresult = lookupTerm \"x\" (indexDoc 1 \"x y\" (indexDoc 2 \"x z\" emptyIndex))\n" - "result") +(define + index-cases (list - (list 1 (list 0)) - (list 2 (list 0)))) + (list + "tokens basic lowercases" + "tokens \"The Cat sat\"" + (list "the" "cat" "sat")) + (list + "tokens strips punctuation" + "tokens \"Hello, World!\"" + (list "hello" "world")) + (list "tokens collapses whitespace" "tokens \" a b \"" (list "a" "b")) + (list "tokens empty is empty" "tokens \"\"" (list)) + (list "tokens keeps digits" "tokens \"abc123 x9\"" (list "abc123" "x9")) + (list + "positioned attaches ordinals" + "positioned \"a b a\"" + (list + (list "a" 0) + (list "b" 1) + (list "a" 2))) + (list + "index + lookup single doc" + "lookupTerm \"cat\" (indexDoc 1 \"the cat sat\" emptyIndex)" + (list (list 1 (list 1)))) + (list + "lookup missing term is empty" + "lookupTerm \"dog\" (indexDoc 1 \"the cat sat\" emptyIndex)" + (list)) + (list + "lookup records all positions" + "lookupTerm \"the\" (indexDoc 1 \"the cat the dog the\" emptyIndex)" + (list (list 1 (list 0 2 4)))) + (list + "multi-doc posting list sorted by docid" + "lookupTerm \"x\" (indexDoc 1 \"x y\" (indexDoc 2 \"x z\" emptyIndex))" + (list + (list 1 (list 0)) + (list 2 (list 0)))) + (list + "index/query case symmetry" + "lookupTerm \"cat\" (indexDoc 1 \"CAT Cat cat\" emptyIndex)" + (list (list 1 (list 0 1 2)))) + (list + "re-index replaces a doc" + "lookupTerm \"a\" (indexDoc 1 \"a a a\" (indexDoc 1 \"a\" emptyIndex))" + (list (list 1 (list 0 1 2)))) + (list + "delete removes a doc" + "lookupTerm \"cat\" (deleteDoc 1 (indexDoc 1 \"the cat\" emptyIndex))" + (list)) + (list + "delete leaves other docs" + "lookupTerm \"cat\" (deleteDoc 2 (indexDoc 2 \"big cat\" (indexDoc 1 \"the cat\" emptyIndex)))" + (list (list 1 (list 1)))) + (list + "docFreq counts docs" + "[docFreq \"cat\" (indexDoc 2 \"a cat\" (indexDoc 1 \"the cat\" emptyIndex))]" + (list 2)) + (list + "docFreq zero for missing" + "[docFreq \"zzz\" (indexDoc 1 \"a b\" emptyIndex)]" + (list 0)) + (list + "allTerms sorted and unique" + "allTerms (indexDoc 1 \"banana apple cherry apple\" emptyIndex)" + (list "apple" "banana" "cherry")) + (list + "allTerms merged across docs" + "allTerms (indexDoc 2 \"d a\" (indexDoc 1 \"c b\" emptyIndex))" + (list "a" "b" "c" "d")))) -(hk-test - "index/query case symmetry" - (search-eval - "\nresult = lookupTerm \"cat\" (indexDoc 1 \"CAT Cat cat\" emptyIndex)\n" - "result") - (list (list 1 (list 0 1 2)))) +(define + index-results + (search-batch "" (map (fn (c) (nth c 1)) index-cases))) -(hk-test - "re-index replaces a doc" - (search-eval - "\nresult = lookupTerm \"a\" (indexDoc 1 \"a a a\" (indexDoc 1 \"a\" emptyIndex))\n" - "result") - (list (list 1 (list 0 1 2)))) - -(hk-test - "delete removes a doc" - (search-eval - "\nresult = lookupTerm \"cat\" (deleteDoc 1 (indexDoc 1 \"the cat\" emptyIndex))\n" - "result") - (list)) - -(hk-test - "delete leaves other docs" - (search-eval - "\nresult = lookupTerm \"cat\" (deleteDoc 2 (indexDoc 2 \"big cat\" (indexDoc 1 \"the cat\" emptyIndex)))\n" - "result") - (list (list 1 (list 1)))) - -(hk-test - "docFreq counts docs" - (search-eval - "\nresult = docFreq \"cat\" (indexDoc 2 \"a cat\" (indexDoc 1 \"the cat\" emptyIndex))\n" - "result") - 2) - -(hk-test - "docFreq zero for missing" - (search-eval - "\nresult = docFreq \"zzz\" (indexDoc 1 \"a b\" emptyIndex)\n" - "result") - 0) - -(hk-test - "allTerms sorted and unique" - (search-eval - "\nresult = allTerms (indexDoc 1 \"banana apple cherry apple\" emptyIndex)\n" - "result") - (list "apple" "banana" "cherry")) - -(hk-test - "allTerms merged across docs" - (search-eval - "\nresult = allTerms (indexDoc 2 \"d a\" (indexDoc 1 \"c b\" emptyIndex))\n" - "result") - (list "a" "b" "c" "d")) +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth index-results i) (nth c 2))) + index-cases) {:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 1baf6e9a..43c60240 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -76,13 +76,14 @@ lib/search/index.sx lib/search/eval.sx ## Phase 2 — Query AST + boolean evaluation -- [ ] Query ADT: `Term Text | And Query Query | Or Query Query | Not Query | - Phrase [Text]` +- [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query | + Phrase [String]` (in `lib/search/query.sx`) - [ ] `lib/search/parse.sx` — query syntax parser (boolean operators, quoted phrases) -- [ ] `lib/search/eval.sx` — boolean eval via set ops on posting lists -- [ ] phrase eval — adjacency check using positions -- [ ] `lib/search/tests/boolean.sx` — 25+ cases: term, and, or, not, phrase, - composition, parser edge cases +- [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists + (sortedUnion/Inter/Diff, Not over allDocs universe) +- [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt) +- [x] `lib/search/tests/boolean.sx` — 28 cases: term, and, or, not, phrase, + composition (parser edge cases move to the parse.sx suite) ## Phase 3 — Ranking @@ -102,6 +103,14 @@ lib/search/index.sx lib/search/eval.sx ## Progress log +- **Phase 2 boolean/phrase eval (46/46 total).** Query ADT + `Term|And|Or|Not|Phrase` + `evalQuery :: Index -> Query -> [DocId]` in query.sx. + Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from + the allDocs universe; Phrase checks positional adjacency. 28 tests in boolean.sx. + Refactored both suites to **batch all cases into one program eval** (search-batch + in testlib) — under the heavy CPU load on this box (~11 on 2 cores), 18–28 separate + hk-eval-program calls timed out; one combined eval per suite is ~20× faster. + Parser (parse.sx) is the remaining Phase 2 box. - **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions), inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/ lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src` From 4c84decc016eaa896a532e5a95712a8bf304a22a Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 19:43:10 +0000 Subject: [PATCH 04/15] search: Phase 2 query parser + 32 tests Query tokenizer + recursive-descent parser: OR --- lib/search/api.sx | 11 ++- lib/search/conformance.conf | 2 + lib/search/parse.sx | 18 +++++ lib/search/scoreboard.json | 9 +-- lib/search/scoreboard.md | 3 +- lib/search/tests/parse.sx | 139 ++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 16 ++++- 7 files changed, 189 insertions(+), 9 deletions(-) create mode 100644 lib/search/parse.sx create mode 100644 lib/search/tests/parse.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index e2da2bb6..5a275f4d 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -2,8 +2,15 @@ ;; Tests and callers concatenate `search/src` with their own top-level bindings ;; (e.g. "result = lookupTerm \"cat\" idx\n") and evaluate via the haskell-on-sx ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, -;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery. +;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery. (define search/src - (str search/tokenize-src "\n" search/index-src "\n" search/query-src)) + (str + search/tokenize-src + "\n" + search/index-src + "\n" + search/query-src + "\n" + search/parse-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 4e418e9f..6e9e8309 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -21,6 +21,7 @@ PRELOADS=( lib/search/tokenize.sx lib/search/index.sx lib/search/query.sx + lib/search/parse.sx lib/search/api.sx lib/search/testlib.sx ) @@ -28,4 +29,5 @@ PRELOADS=( SUITES=( "index:lib/search/tests/index.sx" "boolean:lib/search/tests/boolean.sx" + "parse:lib/search/tests/parse.sx" ) diff --git a/lib/search/parse.sx b/lib/search/parse.sx new file mode 100644 index 00000000..a1dc4c8b --- /dev/null +++ b/lib/search/parse.sx @@ -0,0 +1,18 @@ +;; search query parser — Haskell source fragment. Depends on tokenize + query. +;; Grammar (precedence OR < AND < NOT): +;; expr = orExpr +;; orExpr = andExpr (OR andExpr)* +;; andExpr= notExpr ((AND | ) notExpr)* -- adjacency means AND +;; notExpr= NOT notExpr | atom +;; atom = '(' expr ')' | '"' word+ '"' | word +;; Keywords AND/OR/NOT are case-insensitive; bare words are normalized via tokens. +;; Gotchas: delimiters matched by ord (escaped char literals like '\"' break the +;; haskell-on-sx tokenizer); an [] *pattern* inside a `case` alt also breaks the +;; parser, so qNormTerm/qDropRP/showQ are written as multi-clause functions. +;; parseQuery :: String -> Query +;; searchQuery :: String -> Index -> [DocId] +;; showQ :: Query -> String -- canonical render for tests/debug + +(define + search/parse-src + "data QTok = TAnd | TOr | TNot | TLP | TRP | TWord String | TPhrase [String]\nqIsSpace c = ord c == 32\nqIsLP c = ord c == 40\nqIsRP c = ord c == 41\nqIsQuote c = ord c == 34\nqDelim c = qIsSpace c || qIsLP c || qIsRP c || qIsQuote c\nqReadWord [] = ([], [])\nqReadWord (c:cs) = if qDelim c then ([], c:cs) else let (w, rest) = qReadWord cs in (c:w, rest)\nqReadPhrase [] = ([], [])\nqReadPhrase (c:cs) = if qIsQuote c then ([], cs) else let (w, rest) = qReadPhrase cs in (c:w, rest)\ntoUpperCh c = chr (toUpper (ord c))\nqUpper w = joinChars (map toUpperCh w)\nqFirstTok [] = \"\"\nqFirstTok (x:xs) = x\nqNormTerm w = qFirstTok (tokens w)\nqClassify w = if qUpper w == \"AND\" then TAnd else if qUpper w == \"OR\" then TOr else if qUpper w == \"NOT\" then TNot else TWord (qNormTerm w)\nqPhraseTok cs = let (p, rest) = qReadPhrase cs in TPhrase (tokens p) : qtokens rest\nqWordTok cs = let (w, rest) = qReadWord cs in qClassify w : qtokens rest\nqtokens [] = []\nqtokens (c:cs) = if qIsSpace c then qtokens cs else if qIsLP c then TLP : qtokens cs else if qIsRP c then TRP : qtokens cs else if qIsQuote c then qPhraseTok cs else qWordTok (c:cs)\nqDropRP (q, (TRP:rest)) = (q, rest)\nqDropRP (q, ts) = (q, ts)\nparseAtom [] = (Term \"\", [])\nparseAtom (TLP:ts) = qDropRP (parseExpr ts)\nparseAtom (TPhrase ps : ts) = (Phrase ps, ts)\nparseAtom (TWord w : ts) = (Term w, ts)\nparseAtom ts = (Term \"\", ts)\nqWrapNot (q, ts) = (Not q, ts)\nparseNot (TNot:ts) = qWrapNot (parseNot ts)\nparseNot ts = parseAtom ts\nqStartsAtom (TWord w : ts) = True\nqStartsAtom (TPhrase p : ts) = True\nqStartsAtom (TLP : ts) = True\nqStartsAtom (TNot : ts) = True\nqStartsAtom ts = False\nqAndStep left ts = let (r, rest) = parseNot ts in parseAndR (And left r) rest\nparseAndR left (TAnd:ts) = qAndStep left ts\nparseAndR left ts = if qStartsAtom ts then qAndStep left ts else (left, ts)\nparseAnd ts = let (l, rest) = parseNot ts in parseAndR l rest\nparseOrR left (TOr:ts) = let (r, rest) = parseAnd ts in parseOrR (Or left r) rest\nparseOrR left ts = (left, ts)\nparseExpr ts = let (l, rest) = parseAnd ts in parseOrR l rest\nparseQuery s = fst (parseExpr (qtokens s))\nsearchQuery s idx = evalQuery idx (parseQuery s)\njoinSp [] = \"\"\njoinSp [x] = x\njoinSp (x:xs) = x ++ \"-\" ++ joinSp xs\nshowQ (Term t) = \"T:\" ++ t\nshowQ (And a b) = \"(\" ++ showQ a ++ \" & \" ++ showQ b ++ \")\"\nshowQ (Or a b) = \"(\" ++ showQ a ++ \" | \" ++ showQ b ++ \")\"\nshowQ (Not a) = \"!\" ++ showQ a\nshowQ (Phrase ts) = \"P:\" ++ joinSp ts\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 51e8a2ec..4aab2a38 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,11 +1,12 @@ { "lang": "search", - "total_passed": 46, + "total_passed": 78, "total_failed": 0, - "total": 46, + "total": 78, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, - {"name":"boolean","passed":28,"failed":0,"total":28} + {"name":"boolean","passed":28,"failed":0,"total":28}, + {"name":"parse","passed":32,"failed":0,"total":32} ], - "generated": "2026-06-06T18:46:54+00:00" + "generated": "2026-06-06T19:42:39+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index a214ce29..0a71fd42 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,8 +1,9 @@ # search scoreboard -**46 / 46 passing** (0 failure(s)). +**78 / 78 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| | index | 18 | 18 | ok | | boolean | 28 | 28 | ok | +| parse | 32 | 32 | ok | diff --git a/lib/search/tests/parse.sx b/lib/search/tests/parse.sx new file mode 100644 index 00000000..8f7f0ebd --- /dev/null +++ b/lib/search/tests/parse.sx @@ -0,0 +1,139 @@ +;; Phase 2 — query parser (parseQuery / searchQuery). +;; AST cases assert showQ (parseQuery s); search cases assert searchQuery s idx +;; against the standard corpus. Each group runs in one batched program eval. +;; doc 1 "the quick brown dog" doc 2 "a quick brown fox" doc 3 "the dog barks loudly" + +(define + parse-corpus + "idx = indexDoc 3 \"the dog barks loudly\" (indexDoc 2 \"a quick brown fox\" (indexDoc 1 \"the quick brown dog\" emptyIndex))\n") + +(define + ast-cases + (list + (list "single term" "showQ (parseQuery \"cat\")" "T:cat") + (list "term normalized" "showQ (parseQuery \"CAT\")" "T:cat") + (list "explicit and" "showQ (parseQuery \"cat AND dog\")" "(T:cat & T:dog)") + (list + "lowercase and keyword" + "showQ (parseQuery \"cat and dog\")" + "(T:cat & T:dog)") + (list "implicit and" "showQ (parseQuery \"cat dog\")" "(T:cat & T:dog)") + (list "or" "showQ (parseQuery \"cat OR dog\")" "(T:cat | T:dog)") + (list "not" "showQ (parseQuery \"NOT cat\")" "!T:cat") + (list + "and binds tighter than or" + "showQ (parseQuery \"cat AND dog OR bird\")" + "((T:cat & T:dog) | T:bird)") + (list + "or then and" + "showQ (parseQuery \"cat OR dog AND bird\")" + "(T:cat | (T:dog & T:bird))") + (list + "parens override precedence" + "showQ (parseQuery \"(cat OR dog) AND bird\")" + "((T:cat | T:dog) & T:bird)") + (list + "and with not" + "showQ (parseQuery \"cat AND NOT dog\")" + "(T:cat & !T:dog)") + (list + "two-word phrase" + "showQ (parseQuery \"\\\"quick brown\\\"\")" + "P:quick-brown") + (list + "three-word phrase" + "showQ (parseQuery \"\\\"quick brown fox\\\"\")" + "P:quick-brown-fox") + (list + "and left-assoc" + "showQ (parseQuery \"a AND b AND c\")" + "((T:a & T:b) & T:c)") + (list + "or left-assoc" + "showQ (parseQuery \"a OR b OR c\")" + "((T:a | T:b) | T:c)") + (list + "punctuation stripped" + "showQ (parseQuery \"cat, dog!\")" + "(T:cat & T:dog)"))) + +(define + search-cases + (list + (list "term" "searchQuery \"quick\" idx" (list 1 2)) + (list + "term normalized" + "searchQuery \"QUICK\" idx" + (list 1 2)) + (list + "explicit and" + "searchQuery \"quick AND brown\" idx" + (list 1 2)) + (list + "implicit and" + "searchQuery \"quick brown\" idx" + (list 1 2)) + (list "and disjoint" "searchQuery \"the AND fox\" idx" (list)) + (list "or" "searchQuery \"fox OR barks\" idx" (list 2 3)) + (list "not" "searchQuery \"NOT the\" idx" (list 2)) + (list "and not" "searchQuery \"quick AND NOT the\" idx" (list 2)) + (list + "precedence and-or" + "searchQuery \"the AND dog OR fox\" idx" + (list 1 2 3)) + (list + "precedence or-and" + "searchQuery \"fox OR the AND dog\" idx" + (list 1 2 3)) + (list + "parens" + "searchQuery \"the AND (dog OR fox)\" idx" + (list 1 3)) + (list + "phrase" + "searchQuery \"\\\"quick brown\\\"\" idx" + (list 1 2)) + (list + "phrase one doc" + "searchQuery \"\\\"brown dog\\\"\" idx" + (list 1)) + (list + "phrase and term" + "searchQuery \"\\\"quick brown\\\" AND dog\" idx" + (list 1)) + (list + "not phrase" + "searchQuery \"NOT \\\"quick brown\\\"\" idx" + (list 3)) + (list + "implicit and terms" + "searchQuery \"dog barks\" idx" + (list 3)))) + +(define + ast-results + (search-batch "" (map (fn (c) (nth c 1)) ast-cases))) +(define + search-results + (search-batch + parse-corpus + (map (fn (c) (nth c 1)) search-cases))) + +(map-indexed + (fn + (i c) + (hk-test + (str "ast: " (nth c 0)) + (nth ast-results i) + (nth c 2))) + ast-cases) +(map-indexed + (fn + (i c) + (hk-test + (str "search: " (nth c 0)) + (nth search-results i) + (nth c 2))) + search-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 43c60240..1ebb57b6 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -10,7 +10,7 @@ extension that merges per-peer indices. ## Status (rolling) -`bash lib/search/conformance.sh` → **18/18** (Phase 1 complete) +`bash lib/search/conformance.sh` → **78/78** (Phases 1–2 complete) ## Ground rules @@ -78,7 +78,9 @@ lib/search/index.sx lib/search/eval.sx - [x] Query ADT: `Term String | And Query Query | Or Query Query | Not Query | Phrase [String]` (in `lib/search/query.sx`) -- [ ] `lib/search/parse.sx` — query syntax parser (boolean operators, quoted phrases) +- [x] `lib/search/parse.sx` — query syntax parser: tokenizer + recursive-descent + (OR < AND < NOT precedence, implicit AND on adjacency, quoted phrases, parens, + case-insensitive keywords); `parseQuery`, `searchQuery`, `showQ` - [x] `lib/search/query.sx` — boolean eval via set ops on docid-sorted posting lists (sortedUnion/Inter/Diff, Not over allDocs universe) - [x] phrase eval — positional adjacency check (phraseInDoc / phraseStartsAt) @@ -103,6 +105,16 @@ lib/search/index.sx lib/search/eval.sx ## Progress log +- **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based + delimiters, quoted phrases) + recursive-descent parser with OR Query -> [DocId]` in query.sx. Boolean ops are linear merges over docid-sorted posting lists; Not subtracts from From a3f9d4f6c90e66d6efd282e726eb5297d6328e44 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 19:56:50 +0000 Subject: [PATCH 05/15] search: Phase 3 ranking TF-IDF + BM25 + top-N + 23 tests rankTfIdf and rankBm25 (configurable k1/b) over the candidate set, float scores with deterministic DocId tiebreak; topNTfIdf/topNBm25. df/idf derived from posting-list length. Tests cover tf/idf behavior, a BM25-vs-TF-IDF flip from length-norm + tf-saturation, the b-parameter effect, tiebreak stability. 101/101. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 ++- lib/search/conformance.conf | 2 + lib/search/rank.sx | 14 ++++++ lib/search/scoreboard.json | 9 ++-- lib/search/scoreboard.md | 3 +- lib/search/tests/rank.sx | 90 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 21 ++++++--- 7 files changed, 132 insertions(+), 14 deletions(-) create mode 100644 lib/search/rank.sx create mode 100644 lib/search/tests/rank.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 5a275f4d..2eaeac96 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -2,7 +2,8 @@ ;; Tests and callers concatenate `search/src` with their own top-level bindings ;; (e.g. "result = lookupTerm \"cat\" idx\n") and evaluate via the haskell-on-sx ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, -;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery. +;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery, +;; rankTfIdf, rankBm25, topNTfIdf, topNBm25. (define search/src @@ -13,4 +14,6 @@ "\n" search/query-src "\n" - search/parse-src)) + search/parse-src + "\n" + search/rank-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 6e9e8309..9793c9cc 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -22,6 +22,7 @@ PRELOADS=( lib/search/index.sx lib/search/query.sx lib/search/parse.sx + lib/search/rank.sx lib/search/api.sx lib/search/testlib.sx ) @@ -30,4 +31,5 @@ SUITES=( "index:lib/search/tests/index.sx" "boolean:lib/search/tests/boolean.sx" "parse:lib/search/tests/parse.sx" + "rank:lib/search/tests/rank.sx" ) diff --git a/lib/search/rank.sx b/lib/search/rank.sx new file mode 100644 index 00000000..efe40bb5 --- /dev/null +++ b/lib/search/rank.sx @@ -0,0 +1,14 @@ +;; search ranking — Haskell source fragment. Depends on tokenize + index + query. +;; Ranked retrieval over the candidate set (docs containing any query term). +;; Scores are floats; ties broken by DocId ascending (deterministic). +;; numDocs :: Index -> Int +;; docFreq :: Term -> Index -> Int (from index) +;; docLen :: DocId -> Index -> Int +;; rankTfIdf :: [Term] -> Index -> [DocId] +;; topNTfIdf :: Int -> [Term] -> Index -> [DocId] +;; rankBm25 :: Float -> Float -> [Term] -> Index -> [DocId] (k1, b) +;; topNBm25 :: Int -> Float -> Float -> [Term] -> Index -> [DocId] + +(define + search/rank-src + "numDocs idx = length (allDocs idx)\ntfIn t d idx = length (posIn t d idx)\nqIdf n df = if df == 0 then 0 else log (n / df)\nidf t idx = qIdf (numDocs idx) (docFreq t idx)\ntermScoreTf idx d t = tfIn t d idx * idf t idx\ntfidfDoc ts idx d = sum (map (termScoreTf idx d) ts)\ncandStep idx acc t = sortedUnion acc (docsWith t idx)\ncandDocs ts idx = foldl (candStep idx) [] ts\ncmpScore p1 p2 = if fst p1 > fst p2 then LT else if fst p1 < fst p2 then GT else compare (snd p1) (snd p2)\nmkPair f ts idx d = (f ts idx d, d)\nrankWith f ts idx = map snd (sortBy cmpScore (map (mkPair f ts idx) (candDocs ts idx)))\nrankTfIdf ts idx = rankWith tfidfDoc ts idx\ntopNTfIdf n ts idx = take n (rankTfIdf ts idx)\ntfAt d idx t = tfIn t d idx\ndocLen d idx = sum (map (tfAt d idx) (allTerms idx))\nlenAt idx d = docLen d idx\navgDocLen idx = sum (map (lenAt idx) (allDocs idx)) / numDocs idx\nbm25idf t idx = log ((numDocs idx - docFreq t idx + 0.5) / (docFreq t idx + 0.5) + 1)\nbm25Term k1 b avgdl idx d t = bm25idf t idx * (tfIn t d idx * (k1 + 1)) / (tfIn t d idx + k1 * (1 - b + b * docLen d idx / avgdl))\nbm25Doc k1 b ts idx d = sum (map (bm25Term k1 b (avgDocLen idx) idx d) ts)\nrankBm25 k1 b ts idx = rankWith (bm25Doc k1 b) ts idx\ntopNBm25 n k1 b ts idx = take n (rankBm25 k1 b ts idx)\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 4aab2a38..eb9509f9 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,12 +1,13 @@ { "lang": "search", - "total_passed": 78, + "total_passed": 101, "total_failed": 0, - "total": 78, + "total": 101, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, - {"name":"parse","passed":32,"failed":0,"total":32} + {"name":"parse","passed":32,"failed":0,"total":32}, + {"name":"rank","passed":23,"failed":0,"total":23} ], - "generated": "2026-06-06T19:42:39+00:00" + "generated": "2026-06-06T19:56:08+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 0a71fd42..747a4d04 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,9 +1,10 @@ # search scoreboard -**78 / 78 passing** (0 failure(s)). +**101 / 101 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| | index | 18 | 18 | ok | | boolean | 28 | 28 | ok | | parse | 32 | 32 | ok | +| rank | 23 | 23 | ok | diff --git a/lib/search/tests/rank.sx b/lib/search/tests/rank.sx new file mode 100644 index 00000000..6200106f --- /dev/null +++ b/lib/search/tests/rank.sx @@ -0,0 +1,90 @@ +;; Phase 3 — ranking (TF-IDF, BM25, top-N). Deterministic: ties broken by DocId. +;; Corpora: +;; idx1: 1 "alpha alpha alpha gamma" 2 "alpha" 3 "beta" +;; idx2: 1 "cat" 2 "cat cat dog elephant frog grape" 3 "zzz" +;; idx3: 1 "kite" 2 "kite" (identical docs -> tiebreak) + +(define + rank-setup + "idx1 = indexDoc 3 \"beta\" (indexDoc 2 \"alpha\" (indexDoc 1 \"alpha alpha alpha gamma\" emptyIndex))\nidx2 = indexDoc 3 \"zzz\" (indexDoc 2 \"cat cat dog elephant frog grape\" (indexDoc 1 \"cat\" emptyIndex))\nidx3 = indexDoc 2 \"kite\" (indexDoc 1 \"kite\" emptyIndex)\n") + +(define + rank-cases + (list + (list + "tfidf tf ordering" + "rankTfIdf [\"alpha\"] idx1" + (list 1 2)) + (list + "tfidf rare term boosts" + "rankTfIdf [\"alpha\", \"beta\"] idx1" + (list 1 3 2)) + (list + "tfidf single-doc term" + "rankTfIdf [\"gamma\"] idx1" + (list 1)) + (list "tfidf absent term empty" "rankTfIdf [\"nope\"] idx1" (list)) + (list "tfidf empty query empty" "rankTfIdf [] idx1" (list)) + (list + "tfidf candidate union tie by docid" + "rankTfIdf [\"beta\", \"gamma\"] idx1" + (list 1 3)) + (list + "tfidf tf ordering idx2" + "rankTfIdf [\"cat\"] idx2" + (list 2 1)) + (list "topN tfidf 1" "topNTfIdf 1 [\"alpha\"] idx1" (list 1)) + (list + "topN tfidf 2" + "topNTfIdf 2 [\"alpha\", \"beta\"] idx1" + (list 1 3)) + (list + "topN exceeds results" + "topNTfIdf 10 [\"gamma\"] idx1" + (list 1)) + (list "topN zero" "topNTfIdf 0 [\"alpha\"] idx1" (list)) + (list + "bm25 tf+length flips tfidf" + "rankBm25 1.5 0.75 [\"cat\"] idx2" + (list 1 2)) + (list + "bm25 b=0 ignores length" + "rankBm25 1.5 0.0 [\"cat\"] idx2" + (list 2 1)) + (list + "bm25 alpha idx1" + "rankBm25 1.5 0.75 [\"alpha\"] idx1" + (list 1 2)) + (list "bm25 absent empty" "rankBm25 1.5 0.75 [\"nope\"] idx1" (list)) + (list + "bm25 single-doc term" + "rankBm25 1.5 0.75 [\"gamma\"] idx1" + (list 1)) + (list "bm25 topN 1" "topNBm25 1 1.5 0.75 [\"cat\"] idx2" (list 1)) + (list + "bm25 same candidate set" + "sort (rankBm25 1.5 0.75 [\"alpha\", \"beta\"] idx1)" + (list 1 2 3)) + (list + "tfidf stable tiebreak" + "rankTfIdf [\"kite\"] idx3" + (list 1 2)) + (list + "bm25 stable tiebreak" + "rankBm25 1.5 0.75 [\"kite\"] idx3" + (list 1 2)) + (list "numDocs" "[numDocs idx1]" (list 3)) + (list "docLen counts tokens" "[docLen 1 idx1]" (list 4)) + (list "docFreq via index" "[docFreq \"alpha\" idx1]" (list 2)))) + +(define + rank-results + (search-batch rank-setup (map (fn (c) (nth c 1)) rank-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth rank-results i) (nth c 2))) + rank-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 1ebb57b6..d1c0689b 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -10,7 +10,7 @@ extension that merges per-peer indices. ## Status (rolling) -`bash lib/search/conformance.sh` → **78/78** (Phases 1–2 complete) +`bash lib/search/conformance.sh` → **101/101** (Phases 1–3 complete) ## Ground rules @@ -89,12 +89,13 @@ lib/search/index.sx lib/search/eval.sx ## Phase 3 — Ranking -- [ ] document frequency tracking — extend index with `df` per term -- [ ] TF-IDF scoring -- [ ] BM25 scoring (configurable k1, b) -- [ ] top-N retrieval (heap-based) -- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF, - ranking stability, top-N correctness +- [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index + (posting-list length); no separate df store needed +- [x] TF-IDF scoring (`rankTfIdf`) +- [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`) +- [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak) +- [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm + + tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N ## Phase 4 — ACL filter + federation @@ -105,6 +106,12 @@ lib/search/index.sx lib/search/eval.sx ## Progress log +- **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25 + (`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores + as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via + sortBy+take. df/idf derived from posting-list length (no separate df store). 23 + tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter + effect. Float division/`log`/float literals all work in haskell-on-sx. - **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based delimiters, quoted phrases) + recursive-descent parser with OR Date: Sat, 6 Jun 2026 20:08:08 +0000 Subject: [PATCH 06/15] search: Phase 4 federation merge + ACL post-filter + 21 tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit fedIndex merges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to global gid = peer*1000 + local — dedupe by (peer,doc-id) is automatic and positions survive, so ranking runs once over the merge and interleaves peers by score. ACL is a post-rank filter over an injected permit predicate (searchTfIdfAcl/topNTfIdfAcl/searchBm25Acl). Roadmap complete, 122/122. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 ++- lib/search/conformance.conf | 2 + lib/search/fed.sx | 16 +++++ lib/search/scoreboard.json | 9 +-- lib/search/scoreboard.md | 3 +- lib/search/tests/integration.sx | 102 ++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 21 +++++-- 7 files changed, 148 insertions(+), 12 deletions(-) create mode 100644 lib/search/fed.sx create mode 100644 lib/search/tests/integration.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 2eaeac96..a9a3fe12 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -3,7 +3,8 @@ ;; (e.g. "result = lookupTerm \"cat\" idx\n") and evaluate via the haskell-on-sx ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, ;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery, -;; rankTfIdf, rankBm25, topNTfIdf, topNBm25. +;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, +;; topNTfIdfAcl, searchBm25Acl. (define search/src @@ -16,4 +17,6 @@ "\n" search/parse-src "\n" - search/rank-src)) + search/rank-src + "\n" + search/fed-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 9793c9cc..b2ef2f74 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -23,6 +23,7 @@ PRELOADS=( lib/search/query.sx lib/search/parse.sx lib/search/rank.sx + lib/search/fed.sx lib/search/api.sx lib/search/testlib.sx ) @@ -32,4 +33,5 @@ SUITES=( "boolean:lib/search/tests/boolean.sx" "parse:lib/search/tests/parse.sx" "rank:lib/search/tests/rank.sx" + "integration:lib/search/tests/integration.sx" ) diff --git a/lib/search/fed.sx b/lib/search/fed.sx new file mode 100644 index 00000000..36b59462 --- /dev/null +++ b/lib/search/fed.sx @@ -0,0 +1,16 @@ +;; search federation + ACL — Haskell source fragment. Depends on index + rank. +;; Federation merges per-peer INDICES (not ranked results): each peer's local +;; DocIds are relabelled to global ids `gid peer local = peer*1000 + local` +;; (dedupe by (peer,doc-id) is automatic via the bijection), then posting lists +;; are unioned per term. Ranking then runs once over the merged index, which is +;; rank-correct. ACL is a post-rank filter: an injected `permit :: DocId -> Bool` +;; predicate (viewer baked in by the caller) — never baked into the index. +;; fedIndex :: [(PeerId, Index)] -> Index +;; aclFilter :: (DocId -> Bool) -> [DocId] -> [DocId] +;; searchTfIdfAcl :: (DocId -> Bool) -> [Term] -> Index -> [DocId] +;; topNTfIdfAcl :: Int -> (DocId -> Bool) -> [Term] -> Index -> [DocId] +;; searchBm25Acl :: (DocId -> Bool) -> Float -> Float -> [Term] -> Index -> [DocId] + +(define + search/fed-src + "gid peer local = peer * 1000 + local\nfedRelabelPosting peer p = (gid peer (fst p), snd p)\nfedRelabelEntry peer e = (fst e, map (fedRelabelPosting peer) (snd e))\nfedRelabelIndex peer idx = map (fedRelabelEntry peer) idx\nfedInsP p [] = [p]\nfedInsP p (q:qs) = if fst p < fst q then p : q : qs else if fst p == fst q then p : qs else q : fedInsP p qs\nfedMergePL a b = foldr fedInsP b a\nfedInsTerm t pl [] = [(t, pl)]\nfedInsTerm t pl (x:xs) = if t < fst x then (t, pl) : x : xs else if t == fst x then (fst x, fedMergePL pl (snd x)) : xs else x : fedInsTerm t pl xs\nfedMergeEntry idx e = fedInsTerm (fst e) (snd e) idx\nfedMergeTwo a b = foldl fedMergeEntry a b\nfedAddPeer acc pair = fedMergeTwo acc (fedRelabelIndex (fst pair) (snd pair))\nfedIndex pairs = foldl fedAddPeer emptyIndex pairs\naclFilter permit docs = filter permit docs\nsearchTfIdfAcl permit ts idx = aclFilter permit (rankTfIdf ts idx)\ntopNTfIdfAcl n permit ts idx = take n (aclFilter permit (rankTfIdf ts idx))\nsearchBm25Acl permit k1 b ts idx = aclFilter permit (rankBm25 k1 b ts idx)\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index eb9509f9..d1cb07da 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,13 +1,14 @@ { "lang": "search", - "total_passed": 101, + "total_passed": 122, "total_failed": 0, - "total": 101, + "total": 122, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, {"name":"parse","passed":32,"failed":0,"total":32}, - {"name":"rank","passed":23,"failed":0,"total":23} + {"name":"rank","passed":23,"failed":0,"total":23}, + {"name":"integration","passed":21,"failed":0,"total":21} ], - "generated": "2026-06-06T19:56:08+00:00" + "generated": "2026-06-06T20:07:30+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 747a4d04..03a1d66c 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**101 / 101 passing** (0 failure(s)). +**122 / 122 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -8,3 +8,4 @@ | boolean | 28 | 28 | ok | | parse | 32 | 32 | ok | | rank | 23 | 23 | ok | +| integration | 21 | 21 | ok | diff --git a/lib/search/tests/integration.sx b/lib/search/tests/integration.sx new file mode 100644 index 00000000..8c10685e --- /dev/null +++ b/lib/search/tests/integration.sx @@ -0,0 +1,102 @@ +;; Phase 4 — federation (merge per-peer indices) + ACL post-filter. +;; Peers (global id = peer*1000 + local): +;; peer 1: 1 "alpha beta" 2 "alpha gamma" -> 1001 1002 +;; peer 2: 1 "alpha delta" 2 "beta gamma" -> 2001 2002 +;; ACL predicates are injected (viewer baked in by the caller), applied post-rank. + +(define + fed-setup + "p1 = indexDoc 2 \"alpha gamma\" (indexDoc 1 \"alpha beta\" emptyIndex)\np2 = indexDoc 2 \"beta gamma\" (indexDoc 1 \"alpha delta\" emptyIndex)\nfed = fedIndex [(1, p1), (2, p2)]\npermitP1 g = g < 2000\npermitNone g = False\npermitList g = elem g [1002, 2001]\n") + +(define + fed-cases + (list + (list + "fed merges all docs" + "sort (allDocs fed)" + (list 1001 1002 2001 2002)) + (list + "fed docFreq across peers" + "[docFreq \"alpha\" fed]" + (list 3)) + (list "fed docFreq beta" "[docFreq \"beta\" fed]" (list 2)) + (list "fed numDocs" "[numDocs fed]" (list 4)) + (list + "fed term lookup spans peers" + "map fst (lookupTerm \"gamma\" fed)" + (list 1002 2002)) + (list + "fed preserves positions" + "lookupTerm \"beta\" fed" + (list + (list 1001 (list 1)) + (list 2002 (list 0)))) + (list + "fed rank alpha tie by gid" + "rankTfIdf [\"alpha\"] fed" + (list 1001 1002 2001)) + (list + "fed rank beta" + "rankTfIdf [\"beta\"] fed" + (list 1001 2002)) + (list + "fed boolean and" + "searchQuery \"alpha AND beta\" fed" + (list 1001)) + (list + "fed boolean or" + "searchQuery \"delta OR barks\" fed" + (list 2001)) + (list + "fed phrase within peer1" + "searchQuery \"\\\"alpha beta\\\"\" fed" + (list 1001)) + (list + "fed phrase within peer2" + "searchQuery \"\\\"beta gamma\\\"\" fed" + (list 2002)) + (list + "fed phrase peer2 alpha delta" + "searchQuery \"\\\"alpha delta\\\"\" fed" + (list 2001)) + (list "fed empty peer list" "allDocs (fedIndex [])" (list)) + (list + "fed single relabelled peer" + "rankTfIdf [\"alpha\"] (fedIndex [(5, p1)])" + (list 5001 5002)) + (list + "acl peer1 only" + "aclFilter permitP1 (rankTfIdf [\"alpha\"] fed)" + (list 1001 1002)) + (list + "acl allowlist preserves rank order" + "aclFilter permitList (rankTfIdf [\"alpha\"] fed)" + (list 1002 2001)) + (list + "acl topN after filter" + "topNTfIdfAcl 1 permitP1 [\"alpha\"] fed" + (list 1001)) + (list + "acl denies all" + "aclFilter permitNone (rankTfIdf [\"alpha\"] fed)" + (list)) + (list + "acl on bm25" + "searchBm25Acl permitP1 1.5 0.75 [\"alpha\"] fed" + (list 1001 1002)) + (list + "acl end-to-end tfidf" + "searchTfIdfAcl permitP1 [\"alpha\"] fed" + (list 1001 1002)))) + +(define + fed-results + (search-batch fed-setup (map (fn (c) (nth c 1)) fed-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth fed-results i) (nth c 2))) + fed-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index d1c0689b..7324db74 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -10,7 +10,7 @@ extension that merges per-peer indices. ## Status (rolling) -`bash lib/search/conformance.sh` → **101/101** (Phases 1–3 complete) +`bash lib/search/conformance.sh` → **122/122** (Phases 1–4 complete) ## Ground rules @@ -99,13 +99,24 @@ lib/search/index.sx lib/search/eval.sx ## Phase 4 — ACL filter + federation -- [ ] post-filter — each candidate result tested via `(acl/permit? viewer :read doc)` -- [ ] federated query — fan out to peer instances via fed-sx, merge results -- [ ] merge policy — interleave by rank, dedupe by `(peer, doc-id)` -- [ ] `lib/search/tests/integration.sx` — federated search with ACL filter +- [x] post-filter — `aclFilter`/`searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl` take an + injected `permit :: DocId -> Bool` predicate, applied post-rank (never in the index) +- [x] federated query — `fedIndex :: [(PeerId, Index)] -> Index` merges per-peer + inverted indices (union posting lists per term); rank/search run once over the merge +- [x] merge policy — relabel local DocIds to global `gid = peer*1000 + local` + (bijection ⇒ dedupe by (peer,doc-id) is automatic); ranking interleaves peers by score +- [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup, + position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25 ## Progress log +- **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex` + merges per-peer inverted indices (union posting lists per term) after relabelling + local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id) + dedupe automatic and keeps positions, so ranking runs once over the merge and + interleaves peers by score (rank-correct). ACL is a post-rank `filter` over an + injected `permit :: DocId -> Bool` (viewer baked in by the caller) — never in the + index; `searchTfIdfAcl`/`topNTfIdfAcl`/`searchBm25Acl`. 21 integration tests. - **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25 (`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via From 9d3b775b2583ea1fe9d8e4f0ef54675794e24bbd Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 20:22:23 +0000 Subject: [PATCH 07/15] search: prefix/wildcard queries + 14 tests prefixTerms matches indexed terms by prefix (allTerms + isPrefixOf); prefixDocs unions their docs; prefixRankTfIdf ranks via the matched terms. 136/136. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 ++-- lib/search/conformance.conf | 2 ++ lib/search/prefix.sx | 10 ++++++ lib/search/scoreboard.json | 9 +++--- lib/search/scoreboard.md | 3 +- lib/search/tests/prefix.sx | 63 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 11 +++++++ 7 files changed, 97 insertions(+), 7 deletions(-) create mode 100644 lib/search/prefix.sx create mode 100644 lib/search/tests/prefix.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index a9a3fe12..84918b5e 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -4,7 +4,7 @@ ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, ;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery, ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, -;; topNTfIdfAcl, searchBm25Acl. +;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf. (define search/src @@ -19,4 +19,6 @@ "\n" search/rank-src "\n" - search/fed-src)) + search/fed-src + "\n" + search/prefix-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index b2ef2f74..c5d09b5c 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -24,6 +24,7 @@ PRELOADS=( lib/search/parse.sx lib/search/rank.sx lib/search/fed.sx + lib/search/prefix.sx lib/search/api.sx lib/search/testlib.sx ) @@ -34,4 +35,5 @@ SUITES=( "parse:lib/search/tests/parse.sx" "rank:lib/search/tests/rank.sx" "integration:lib/search/tests/integration.sx" + "prefix:lib/search/tests/prefix.sx" ) diff --git a/lib/search/prefix.sx b/lib/search/prefix.sx new file mode 100644 index 00000000..d50a5b1b --- /dev/null +++ b/lib/search/prefix.sx @@ -0,0 +1,10 @@ +;; search prefix / wildcard queries — Haskell source fragment. Depends on index + +;; rank (reuses candStep / rankTfIdf). A prefix matches every indexed term that +;; starts with it; the matching terms are unioned (OR) into a docid set. +;; prefixTerms :: String -> Index -> [Term] (sorted, from allTerms) +;; prefixDocs :: String -> Index -> [DocId] (sorted union) +;; prefixRankTfIdf :: String -> Index -> [DocId] (ranked by the matched terms) + +(define + search/prefix-src + "prefixTerms pre idx = filter (isPrefixOf pre) (allTerms idx)\nprefixDocs pre idx = foldl (candStep idx) [] (prefixTerms pre idx)\nprefixRankTfIdf pre idx = rankTfIdf (prefixTerms pre idx) idx\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index d1cb07da..df5e60d7 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,14 +1,15 @@ { "lang": "search", - "total_passed": 122, + "total_passed": 136, "total_failed": 0, - "total": 122, + "total": 136, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, {"name":"parse","passed":32,"failed":0,"total":32}, {"name":"rank","passed":23,"failed":0,"total":23}, - {"name":"integration","passed":21,"failed":0,"total":21} + {"name":"integration","passed":21,"failed":0,"total":21}, + {"name":"prefix","passed":14,"failed":0,"total":14} ], - "generated": "2026-06-06T20:07:30+00:00" + "generated": "2026-06-06T20:21:41+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 03a1d66c..0578f296 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**122 / 122 passing** (0 failure(s)). +**136 / 136 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -9,3 +9,4 @@ | parse | 32 | 32 | ok | | rank | 23 | 23 | ok | | integration | 21 | 21 | ok | +| prefix | 14 | 14 | ok | diff --git a/lib/search/tests/prefix.sx b/lib/search/tests/prefix.sx new file mode 100644 index 00000000..97776491 --- /dev/null +++ b/lib/search/tests/prefix.sx @@ -0,0 +1,63 @@ +;; Extension — prefix / wildcard queries. +;; Corpus: 1 "alpha alpine" 2 "beta apple" 3 "banana alpha" +;; allTerms sorted: alpha alpine apple banana beta + +(define + prefix-setup + "idx = indexDoc 3 \"banana alpha\" (indexDoc 2 \"beta apple\" (indexDoc 1 \"alpha alpine\" emptyIndex))\n") + +(define + prefix-cases + (list + (list + "prefix terms two matches" + "prefixTerms \"al\" idx" + (list "alpha" "alpine")) + (list + "prefix terms narrower" + "prefixTerms \"alp\" idx" + (list "alpha" "alpine")) + (list + "prefix terms wide" + "prefixTerms \"a\" idx" + (list "alpha" "alpine" "apple")) + (list "prefix terms single" "prefixTerms \"ban\" idx" (list "banana")) + (list "prefix terms exact term" "prefixTerms \"beta\" idx" (list "beta")) + (list "prefix terms none" "prefixTerms \"z\" idx" (list)) + (list + "prefix docs union" + "prefixDocs \"al\" idx" + (list 1 3)) + (list "prefix docs single term" "prefixDocs \"ban\" idx" (list 3)) + (list + "prefix docs wide" + "prefixDocs \"a\" idx" + (list 1 2 3)) + (list "prefix docs none" "prefixDocs \"z\" idx" (list)) + (list + "prefix docs exact" + "prefixDocs \"alpha\" idx" + (list 1 3)) + (list + "prefix rank ranks by matched terms" + "prefixRankTfIdf \"al\" idx" + (list 1 3)) + (list + "prefix rank single doc" + "prefixRankTfIdf \"ban\" idx" + (list 3)) + (list "prefix rank empty" "prefixRankTfIdf \"z\" idx" (list)))) + +(define + prefix-results + (search-batch + prefix-setup + (map (fn (c) (nth c 1)) prefix-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth prefix-results i) (nth c 2))) + prefix-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 7324db74..2f440bd0 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -108,8 +108,19 @@ lib/search/index.sx lib/search/eval.sx - [x] `lib/search/tests/integration.sx` — 21 cases: index merge, cross-peer df/lookup, position preservation, boolean/phrase over the merge, ACL filter + top-N + bm25 +## Extensions (post-roadmap, search-shaped vocabulary) + +- [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests +- [ ] fuzzy matching — edit distance term expansion +- [ ] result pagination (offset / limit) +- [ ] snippet / highlight generation +- [ ] stemming (suffix stripping) — recall-improving normalizer + ## Progress log +- **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every + indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions + their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests. - **Phase 4 complete — federation + ACL (122/122 total). Roadmap done.** `fedIndex` merges per-peer inverted indices (union posting lists per term) after relabelling local DocIds to global `gid = peer*1000 + local` — the bijection makes (peer,doc-id) From 3ab8270a584ef20b5ed9d63819bae72e8dd28ea4 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 20:55:25 +0000 Subject: [PATCH 08/15] search: result pagination (offset/limit) + 12 tests paginate windows a ranked list (take lim . drop off); pageTfIdf/pageBm25 and resultCount. 148/148. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 +++-- lib/search/conformance.conf | 2 ++ lib/search/page.sx | 11 ++++++++ lib/search/scoreboard.json | 9 ++++--- lib/search/scoreboard.md | 3 ++- lib/search/tests/page.sx | 53 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 7 ++++- 7 files changed, 84 insertions(+), 8 deletions(-) create mode 100644 lib/search/page.sx create mode 100644 lib/search/tests/page.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 84918b5e..cef49db4 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -4,7 +4,8 @@ ;; interpreter. Public Haskell entry points: indexDoc, lookupTerm, deleteDoc, ;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery, ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, -;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf. +;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, +;; paginate, pageTfIdf, pageBm25, resultCount. (define search/src @@ -21,4 +22,6 @@ "\n" search/fed-src "\n" - search/prefix-src)) + search/prefix-src + "\n" + search/page-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index c5d09b5c..79b14819 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -25,6 +25,7 @@ PRELOADS=( lib/search/rank.sx lib/search/fed.sx lib/search/prefix.sx + lib/search/page.sx lib/search/api.sx lib/search/testlib.sx ) @@ -36,4 +37,5 @@ SUITES=( "rank:lib/search/tests/rank.sx" "integration:lib/search/tests/integration.sx" "prefix:lib/search/tests/prefix.sx" + "page:lib/search/tests/page.sx" ) diff --git a/lib/search/page.sx b/lib/search/page.sx new file mode 100644 index 00000000..93b57dd4 --- /dev/null +++ b/lib/search/page.sx @@ -0,0 +1,11 @@ +;; search pagination — Haskell source fragment. Depends on rank. +;; Windows a ranked result list by offset/limit (offset >= length -> empty; +;; limit clamps to what remains). +;; paginate :: Int -> Int -> [DocId] -> [DocId] (offset, limit) +;; pageTfIdf :: Int -> Int -> [Term] -> Index -> [DocId] +;; pageBm25 :: Int -> Int -> Float -> Float -> [Term] -> Index -> [DocId] +;; resultCount :: [Term] -> Index -> Int + +(define + search/page-src + "paginate off lim docs = take lim (drop off docs)\npageTfIdf off lim ts idx = paginate off lim (rankTfIdf ts idx)\npageBm25 off lim k1 b ts idx = paginate off lim (rankBm25 k1 b ts idx)\nresultCount ts idx = length (rankTfIdf ts idx)\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index df5e60d7..16472224 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,15 +1,16 @@ { "lang": "search", - "total_passed": 136, + "total_passed": 148, "total_failed": 0, - "total": 136, + "total": 148, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, {"name":"parse","passed":32,"failed":0,"total":32}, {"name":"rank","passed":23,"failed":0,"total":23}, {"name":"integration","passed":21,"failed":0,"total":21}, - {"name":"prefix","passed":14,"failed":0,"total":14} + {"name":"prefix","passed":14,"failed":0,"total":14}, + {"name":"page","passed":12,"failed":0,"total":12} ], - "generated": "2026-06-06T20:21:41+00:00" + "generated": "2026-06-06T20:54:50+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 0578f296..9cdc93b3 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**136 / 136 passing** (0 failure(s)). +**148 / 148 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -10,3 +10,4 @@ | rank | 23 | 23 | ok | | integration | 21 | 21 | ok | | prefix | 14 | 14 | ok | +| page | 12 | 12 | ok | diff --git a/lib/search/tests/page.sx b/lib/search/tests/page.sx new file mode 100644 index 00000000..6ad77310 --- /dev/null +++ b/lib/search/tests/page.sx @@ -0,0 +1,53 @@ +;; Extension — result pagination (offset / limit) over ranked results. +;; Corpus (tf of "x" descending): 1 x4 2 x3 3 x2 4 x1 5 y(no x) +;; rankTfIdf ["x"] -> [1,2,3,4] + +(define + page-setup + "idx = indexDoc 5 \"y\" (indexDoc 4 \"x\" (indexDoc 3 \"x x\" (indexDoc 2 \"x x x\" (indexDoc 1 \"x x x x other\" emptyIndex))))\n") + +(define + page-cases + (list + (list "first page" "pageTfIdf 0 2 [\"x\"] idx" (list 1 2)) + (list + "second page" + "pageTfIdf 2 2 [\"x\"] idx" + (list 3 4)) + (list + "sliding window" + "pageTfIdf 1 2 [\"x\"] idx" + (list 2 3)) + (list + "limit exceeds remaining" + "pageTfIdf 3 10 [\"x\"] idx" + (list 4)) + (list "offset past end" "pageTfIdf 4 2 [\"x\"] idx" (list)) + (list "limit zero" "pageTfIdf 0 0 [\"x\"] idx" (list)) + (list + "whole result" + "pageTfIdf 0 10 [\"x\"] idx" + (list 1 2 3 4)) + (list + "paginate raw list" + "paginate 1 2 [10, 20, 30, 40]" + (list 20 30)) + (list "paginate raw past end" "paginate 9 2 [10, 20]" (list)) + (list + "bm25 page window size" + "[length (pageBm25 0 2 1.5 0.75 [\"x\"] idx)]" + (list 2)) + (list "result count" "[resultCount [\"x\"] idx]" (list 4)) + (list "result count zero" "[resultCount [\"zzz\"] idx]" (list 0)))) + +(define + page-results + (search-batch page-setup (map (fn (c) (nth c 1)) page-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth page-results i) (nth c 2))) + page-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 2f440bd0..32444f20 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -112,12 +112,17 @@ lib/search/index.sx lib/search/eval.sx - [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests - [ ] fuzzy matching — edit distance term expansion -- [ ] result pagination (offset / limit) +- [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`, + `resultCount` — 12 tests - [ ] snippet / highlight generation - [ ] stemming (suffix stripping) — recall-improving normalizer ## Progress log +- **Extension: pagination (148/148 total).** `paginate off lim` windows a ranked list + (take lim . drop off); `pageTfIdf`/`pageBm25` + `resultCount`. 12 tests. Note the + full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s + under the heavy box load. - **Extension: prefix/wildcard queries (136/136 total).** `prefixTerms` matches every indexed term starting with a prefix (via allTerms + isPrefixOf); `prefixDocs` unions their docs; `prefixRankTfIdf` ranks treating the matched terms as the query. 14 tests. From 5945b51cfd5a271b761b9aa34765b1fcad622050 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 21:47:56 +0000 Subject: [PATCH 09/15] search: fuzzy matching via edit distance + 18 tests editDist as an O(m*n) row-based Levenshtein DP (naive recursion is exponential and times out under load); fuzzyTerms/fuzzyDocs/fuzzyRankTfIdf expand a term to indexed terms within a max edit distance. 166/166. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 +++- lib/search/conformance.conf | 2 + lib/search/fuzzy.sx | 12 ++++++ lib/search/scoreboard.json | 9 +++-- lib/search/scoreboard.md | 3 +- lib/search/tests/fuzzy.sx | 74 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 7 +++- 7 files changed, 106 insertions(+), 8 deletions(-) create mode 100644 lib/search/fuzzy.sx create mode 100644 lib/search/tests/fuzzy.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index cef49db4..c55c7f31 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -5,7 +5,8 @@ ;; docFreq, allTerms, tokens, positioned, evalQuery, parseQuery, searchQuery, ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, -;; paginate, pageTfIdf, pageBm25, resultCount. +;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, +;; fuzzyRankTfIdf. (define search/src @@ -24,4 +25,6 @@ "\n" search/prefix-src "\n" - search/page-src)) + search/page-src + "\n" + search/fuzzy-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 79b14819..0fef2c39 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -26,6 +26,7 @@ PRELOADS=( lib/search/fed.sx lib/search/prefix.sx lib/search/page.sx + lib/search/fuzzy.sx lib/search/api.sx lib/search/testlib.sx ) @@ -38,4 +39,5 @@ SUITES=( "integration:lib/search/tests/integration.sx" "prefix:lib/search/tests/prefix.sx" "page:lib/search/tests/page.sx" + "fuzzy:lib/search/tests/fuzzy.sx" ) diff --git a/lib/search/fuzzy.sx b/lib/search/fuzzy.sx new file mode 100644 index 00000000..9a757abb --- /dev/null +++ b/lib/search/fuzzy.sx @@ -0,0 +1,12 @@ +;; search fuzzy matching — Haskell source fragment. Depends on index + rank. +;; Levenshtein edit distance (O(m*n) row-based DP — the naive recursive version is +;; exponential and far too slow under load) expands a query term to all indexed +;; terms within a max distance, then unions / ranks their docs. +;; editDist :: String -> String -> Int +;; fuzzyTerms :: Int -> String -> Index -> [Term] (sorted) +;; fuzzyDocs :: Int -> String -> Index -> [DocId] (sorted union) +;; fuzzyRankTfIdf :: Int -> String -> Index -> [DocId] + +(define + search/fuzzy-src + "edMin3 a b c = min a (min b c)\nedCost x y = if x == y then 0 else 1\nedUpto i n = if i > n then [] else i : edUpto (i + 1) n\nedLast [x] = x\nedLast (x:xs) = edLast xs\nedNrow x [] prev left = []\nedNrow x (y:ys) prev left = let v = edMin3 (head (tail prev) + 1) (left + 1) (head prev + edCost x y) in v : edNrow x ys (tail prev) v\nedRow x ys prev = let f = head prev + 1 in f : edNrow x ys prev f\nedRows [] ys prev = prev\nedRows (x:xs) ys prev = edRows xs ys (edRow x ys prev)\neditDist xs ys = edLast (edRows xs ys (edUpto 0 (length ys)))\nqWithinDist maxd term t = editDist term t <= maxd\nfuzzyTerms maxd term idx = filter (qWithinDist maxd term) (allTerms idx)\nfuzzyDocs maxd term idx = foldl (candStep idx) [] (fuzzyTerms maxd term idx)\nfuzzyRankTfIdf maxd term idx = rankTfIdf (fuzzyTerms maxd term idx) idx\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 16472224..b0baf95a 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 148, + "total_passed": 166, "total_failed": 0, - "total": 148, + "total": 166, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -10,7 +10,8 @@ {"name":"rank","passed":23,"failed":0,"total":23}, {"name":"integration","passed":21,"failed":0,"total":21}, {"name":"prefix","passed":14,"failed":0,"total":14}, - {"name":"page","passed":12,"failed":0,"total":12} + {"name":"page","passed":12,"failed":0,"total":12}, + {"name":"fuzzy","passed":18,"failed":0,"total":18} ], - "generated": "2026-06-06T20:54:50+00:00" + "generated": "2026-06-06T21:47:28+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 9cdc93b3..74440558 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**148 / 148 passing** (0 failure(s)). +**166 / 166 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -11,3 +11,4 @@ | integration | 21 | 21 | ok | | prefix | 14 | 14 | ok | | page | 12 | 12 | ok | +| fuzzy | 18 | 18 | ok | diff --git a/lib/search/tests/fuzzy.sx b/lib/search/tests/fuzzy.sx new file mode 100644 index 00000000..0b5c3fbd --- /dev/null +++ b/lib/search/tests/fuzzy.sx @@ -0,0 +1,74 @@ +;; Extension — fuzzy matching via Levenshtein edit distance. +;; Corpus: 1 "color flavor" 2 "colour kitten" 3 "colored" +;; allTerms: color colored colour flavor kitten + +(define + fuzzy-setup + "idx = indexDoc 3 \"colored\" (indexDoc 2 \"colour kitten\" (indexDoc 1 \"color flavor\" emptyIndex))\n") + +(define + fuzzy-cases + (list + (list + "editDist substitution" + "[editDist \"kitten\" \"sitten\"]" + (list 1)) + (list "editDist equal" "[editDist \"abc\" \"abc\"]" (list 0)) + (list "editDist deletion" "[editDist \"abc\" \"ab\"]" (list 1)) + (list "editDist insertion" "[editDist \"ab\" \"abc\"]" (list 1)) + (list "editDist from empty" "[editDist \"\" \"abc\"]" (list 3)) + (list "editDist both empty" "[editDist \"\" \"\"]" (list 0)) + (list + "editDist classic" + "[editDist \"kitten\" \"sitting\"]" + (list 3)) + (list + "editDist color colour" + "[editDist \"color\" \"colour\"]" + (list 1)) + (list + "editDist color colored" + "[editDist \"color\" \"colored\"]" + (list 2)) + (list + "fuzzy terms dist 1" + "fuzzyTerms 1 \"color\" idx" + (list "color" "colour")) + (list + "fuzzy terms dist 2" + "fuzzyTerms 2 \"color\" idx" + (list "color" "colored" "colour")) + (list "fuzzy terms exact" "fuzzyTerms 0 \"color\" idx" (list "color")) + (list + "fuzzy terms other word" + "fuzzyTerms 1 \"flavour\" idx" + (list "flavor")) + (list + "fuzzy docs dist 1" + "fuzzyDocs 1 \"color\" idx" + (list 1 2)) + (list + "fuzzy docs dist 2" + "fuzzyDocs 2 \"color\" idx" + (list 1 2 3)) + (list "fuzzy docs none" "fuzzyDocs 1 \"zzzzz\" idx" (list)) + (list + "fuzzy rank dist 1" + "fuzzyRankTfIdf 1 \"color\" idx" + (list 1 2)) + (list + "fuzzy rank dist 2" + "fuzzyRankTfIdf 2 \"color\" idx" + (list 1 2 3)))) + +(define + fuzzy-results + (search-batch fuzzy-setup (map (fn (c) (nth c 1)) fuzzy-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth fuzzy-results i) (nth c 2))) + fuzzy-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 32444f20..791c04dc 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -111,7 +111,8 @@ lib/search/index.sx lib/search/eval.sx ## Extensions (post-roadmap, search-shaped vocabulary) - [x] prefix / wildcard queries (`prefixTerms`, `prefixDocs`, `prefixRankTfIdf`) — 14 tests -- [ ] fuzzy matching — edit distance term expansion +- [x] fuzzy matching — edit distance term expansion (`editDist`, `fuzzyTerms`, + `fuzzyDocs`, `fuzzyRankTfIdf`) — 18 tests - [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`, `resultCount` — 12 tests - [ ] snippet / highlight generation @@ -119,6 +120,10 @@ lib/search/index.sx lib/search/eval.sx ## Progress log +- **Extension: fuzzy matching (166/166 total).** Levenshtein `editDist` as an O(m*n) + row-based DP (the naive recursive version is exponential and times out under load), + `fuzzyTerms`/`fuzzyDocs`/`fuzzyRankTfIdf` expand a term to indexed terms within a max + edit distance. 18 tests. - **Extension: pagination (148/148 total).** `paginate off lim` windows a ranked list (take lim . drop off); `pageTfIdf`/`pageBm25` + `resultCount`. 12 tests. Note the full conformance now runs 8 suites sequentially and needs an overall timeout ~1900s From 7231cb651f1d7be649adfe70ade87485352ed255 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 22:08:00 +0000 Subject: [PATCH 10/15] search: highlight + snippet generation + 12 tests highlight marks query-matching (normalized) tokens with [..]; snippet extracts a context window around the first match. 178/178. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 ++-- lib/search/conformance.conf | 2 ++ lib/search/highlight.sx | 10 ++++++ lib/search/scoreboard.json | 9 ++--- lib/search/scoreboard.md | 3 +- lib/search/tests/highlight.sx | 66 +++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 5 ++- 7 files changed, 93 insertions(+), 8 deletions(-) create mode 100644 lib/search/highlight.sx create mode 100644 lib/search/tests/highlight.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index c55c7f31..7abbe781 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -6,7 +6,7 @@ ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, -;; fuzzyRankTfIdf. +;; fuzzyRankTfIdf, highlight, snippet. (define search/src @@ -27,4 +27,6 @@ "\n" search/page-src "\n" - search/fuzzy-src)) + search/fuzzy-src + "\n" + search/highlight-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 0fef2c39..28c7ddf6 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -27,6 +27,7 @@ PRELOADS=( lib/search/prefix.sx lib/search/page.sx lib/search/fuzzy.sx + lib/search/highlight.sx lib/search/api.sx lib/search/testlib.sx ) @@ -40,4 +41,5 @@ SUITES=( "prefix:lib/search/tests/prefix.sx" "page:lib/search/tests/page.sx" "fuzzy:lib/search/tests/fuzzy.sx" + "highlight:lib/search/tests/highlight.sx" ) diff --git a/lib/search/highlight.sx b/lib/search/highlight.sx new file mode 100644 index 00000000..4c5def99 --- /dev/null +++ b/lib/search/highlight.sx @@ -0,0 +1,10 @@ +;; search highlight / snippet — Haskell source fragment. Depends on tokenize. +;; Operates on document text (not the index): marks query-matching tokens with +;; [..] and extracts a context window around the first match. Tokens are +;; normalized (lowercase, punctuation-stripped) by `tokens`, matching index side. +;; highlight :: [Term] -> String -> String +;; snippet :: Int -> [Term] -> String -> String (ctx tokens each side of 1st match) + +(define + search/highlight-src + "hlMark terms t = if elem t terms then \"[\" ++ t ++ \"]\" else t\nhighlight terms text = unwords (map (hlMark terms) (tokens text))\nhlIdxFrom terms [] i = 0 - 1\nhlIdxFrom terms (t:ts) i = if elem t terms then i else hlIdxFrom terms ts (i + 1)\nhlIdx terms toks = hlIdxFrom terms toks 0\nhlMax0 x = if x < 0 then 0 else x\nsnipStart ctx i = if i < 0 then 0 else hlMax0 (i - ctx)\nsnipToks ctx terms toks = unwords (map (hlMark terms) (take (2 * ctx + 1) (drop (snipStart ctx (hlIdx terms toks)) toks)))\nsnippet ctx terms text = snipToks ctx terms (tokens text)\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index b0baf95a..a3ebb24c 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 166, + "total_passed": 178, "total_failed": 0, - "total": 166, + "total": 178, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -11,7 +11,8 @@ {"name":"integration","passed":21,"failed":0,"total":21}, {"name":"prefix","passed":14,"failed":0,"total":14}, {"name":"page","passed":12,"failed":0,"total":12}, - {"name":"fuzzy","passed":18,"failed":0,"total":18} + {"name":"fuzzy","passed":18,"failed":0,"total":18}, + {"name":"highlight","passed":12,"failed":0,"total":12} ], - "generated": "2026-06-06T21:47:28+00:00" + "generated": "2026-06-06T22:07:05+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 74440558..767c5fc2 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**166 / 166 passing** (0 failure(s)). +**178 / 178 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -12,3 +12,4 @@ | prefix | 14 | 14 | ok | | page | 12 | 12 | ok | | fuzzy | 18 | 18 | ok | +| highlight | 12 | 12 | ok | diff --git a/lib/search/tests/highlight.sx b/lib/search/tests/highlight.sx new file mode 100644 index 00000000..3a5001d5 --- /dev/null +++ b/lib/search/tests/highlight.sx @@ -0,0 +1,66 @@ +;; Extension — highlight + snippet over document text. +;; Text: "the quick brown fox jumps" + +(define + hl-cases + (list + (list + "highlight two terms" + "highlight [\"quick\", \"fox\"] \"the quick brown fox jumps\"" + "the [quick] brown [fox] jumps") + (list + "highlight none" + "highlight [] \"the quick brown fox jumps\"" + "the quick brown fox jumps") + (list + "highlight absent term" + "highlight [\"zzz\"] \"the quick brown fox jumps\"" + "the quick brown fox jumps") + (list + "highlight first token" + "highlight [\"the\"] \"the quick brown fox jumps\"" + "[the] quick brown fox jumps") + (list + "highlight normalizes text" + "highlight [\"quick\"] \"The Quick, brown!\"" + "the [quick] brown") + (list + "snippet around middle" + "snippet 1 [\"brown\"] \"the quick brown fox jumps\"" + "quick [brown] fox") + (list + "snippet at start" + "snippet 1 [\"the\"] \"the quick brown fox jumps\"" + "[the] quick brown") + (list + "snippet near end" + "snippet 1 [\"fox\"] \"the quick brown fox jumps\"" + "brown [fox] jumps") + (list + "snippet ctx zero" + "snippet 0 [\"brown\"] \"the quick brown fox jumps\"" + "[brown]") + (list + "snippet clamps at end" + "snippet 2 [\"jumps\"] \"the quick brown fox jumps\"" + "brown fox [jumps]") + (list + "snippet no match shows head" + "snippet 1 [\"zzz\"] \"the quick brown fox jumps\"" + "the quick brown") + (list + "snippet wide window" + "snippet 5 [\"brown\"] \"the quick brown fox jumps\"" + "the quick [brown] fox jumps"))) + +(define + hl-results + (search-batch "" (map (fn (c) (nth c 1)) hl-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth hl-results i) (nth c 2))) + hl-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 791c04dc..b2702653 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -115,11 +115,14 @@ lib/search/index.sx lib/search/eval.sx `fuzzyDocs`, `fuzzyRankTfIdf`) — 18 tests - [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`, `resultCount` — 12 tests -- [ ] snippet / highlight generation +- [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests - [ ] stemming (suffix stripping) — recall-improving normalizer ## Progress log +- **Extension: highlight/snippet (178/178 total).** `highlight terms text` marks + query-matching (normalized) tokens with [..]; `snippet ctx terms text` extracts a + context window around the first match. 12 tests. - **Extension: fuzzy matching (166/166 total).** Levenshtein `editDist` as an O(m*n) row-based DP (the naive recursive version is exponential and times out under load), `fuzzyTerms`/`fuzzyDocs`/`fuzzyRankTfIdf` expand a term to indexed terms within a max From 911a2f57c07ff99bef4e986a682b31af220cb6ea Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 22:50:19 +0000 Subject: [PATCH 11/15] search: stemming (suffix stripping) + 18 tests Deterministic English suffix stripping (stem), stemText/stemTokens, indexStemmed. Worked around two haskell-on-sx string gotchas: take/drop over a String yield char codes (rebuild via joinChars . map chr), and isSuffixOf's reverse trips ++ (manual suffix compare). 196/196. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 +++-- lib/search/conformance.conf | 2 ++ lib/search/scoreboard.json | 9 +++---- lib/search/scoreboard.md | 3 ++- lib/search/stem.sx | 15 ++++++++++++ lib/search/tests/stem.sx | 47 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 8 ++++++- 7 files changed, 82 insertions(+), 8 deletions(-) create mode 100644 lib/search/stem.sx create mode 100644 lib/search/tests/stem.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 7abbe781..5ac85924 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -6,7 +6,7 @@ ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, -;; fuzzyRankTfIdf, highlight, snippet. +;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed. (define search/src @@ -29,4 +29,6 @@ "\n" search/fuzzy-src "\n" - search/highlight-src)) + search/highlight-src + "\n" + search/stem-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 28c7ddf6..8c5375b7 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -28,6 +28,7 @@ PRELOADS=( lib/search/page.sx lib/search/fuzzy.sx lib/search/highlight.sx + lib/search/stem.sx lib/search/api.sx lib/search/testlib.sx ) @@ -42,4 +43,5 @@ SUITES=( "page:lib/search/tests/page.sx" "fuzzy:lib/search/tests/fuzzy.sx" "highlight:lib/search/tests/highlight.sx" + "stem:lib/search/tests/stem.sx" ) diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index a3ebb24c..4c88e5e3 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 178, + "total_passed": 196, "total_failed": 0, - "total": 178, + "total": 196, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -12,7 +12,8 @@ {"name":"prefix","passed":14,"failed":0,"total":14}, {"name":"page","passed":12,"failed":0,"total":12}, {"name":"fuzzy","passed":18,"failed":0,"total":18}, - {"name":"highlight","passed":12,"failed":0,"total":12} + {"name":"highlight","passed":12,"failed":0,"total":12}, + {"name":"stem","passed":18,"failed":0,"total":18} ], - "generated": "2026-06-06T22:07:05+00:00" + "generated": "2026-06-06T22:49:33+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 767c5fc2..7e20b449 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**178 / 178 passing** (0 failure(s)). +**196 / 196 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -13,3 +13,4 @@ | page | 12 | 12 | ok | | fuzzy | 18 | 18 | ok | | highlight | 12 | 12 | ok | +| stem | 18 | 18 | ok | diff --git a/lib/search/stem.sx b/lib/search/stem.sx new file mode 100644 index 00000000..816c3269 --- /dev/null +++ b/lib/search/stem.sx @@ -0,0 +1,15 @@ +;; search stemming — Haskell source fragment. Depends on tokenize + index. +;; Lightweight, deterministic English suffix stripping (recall-improving +;; normalizer). Rules are checked most-specific first; conservative length guards +;; avoid mangling short words. Not a full Porter stemmer. +;; Gotcha: take/drop over a String yield char CODES (ints), not char strings, so +;; rebuild strings with `stStr = joinChars . map chr`. (isSuffixOf's reverse also +;; trips `++` on the String representation, hence the manual stEnds.) +;; stem :: String -> String +;; stemText :: String -> String (tokenize + stem + rejoin) +;; stemTokens :: String -> [String] +;; indexStemmed:: DocId -> String -> Index -> Index (index the stemmed text) + +(define + search/stem-src + "stStr cs = joinChars (map chr cs)\nstEnds suf w = let n = length w in let m = length suf in if m > n then False else stStr (drop (n - m) w) == suf\nstDropEnd k w = stStr (take (length w - k) w)\nstem w = if stEnds \"ies\" w && length w >= 5 then stDropEnd 3 w ++ \"y\" else if stEnds \"ss\" w then w else if stEnds \"es\" w && length w >= 5 then stDropEnd 2 w else if stEnds \"s\" w && length w >= 4 then stDropEnd 1 w else if stEnds \"ing\" w && length w >= 6 then stDropEnd 3 w else if stEnds \"ed\" w && length w >= 5 then stDropEnd 2 w else w\nstemTokens s = map stem (tokens s)\nstemText s = unwords (stemTokens s)\nindexStemmed d text idx = indexDoc d (stemText text) idx\n") diff --git a/lib/search/tests/stem.sx b/lib/search/tests/stem.sx new file mode 100644 index 00000000..cffd6c36 --- /dev/null +++ b/lib/search/tests/stem.sx @@ -0,0 +1,47 @@ +;; Extension — stemming (suffix stripping). Scalar string results wrapped in []. + +(define + stem-cases + (list + (list "stem plural s" "[stem \"cats\"]" (list "cat")) + (list "stem plural dogs" "[stem \"dogs\"]" (list "dog")) + (list "stem keeps ss" "[stem \"pass\"]" (list "pass")) + (list "stem short s unchanged" "[stem \"is\"]" (list "is")) + (list "stem es boxes" "[stem \"boxes\"]" (list "box")) + (list "stem es wishes" "[stem \"wishes\"]" (list "wish")) + (list "stem ies cities" "[stem \"cities\"]" (list "city")) + (list "stem ies parties" "[stem \"parties\"]" (list "party")) + (list "stem ing jumping" "[stem \"jumping\"]" (list "jump")) + (list "stem ing running literal" "[stem \"running\"]" (list "runn")) + (list "stem ed jumped" "[stem \"jumped\"]" (list "jump")) + (list "stem ed wanted" "[stem \"wanted\"]" (list "want")) + (list "stem short ed unchanged" "[stem \"red\"]" (list "red")) + (list "stem no suffix" "[stem \"cat\"]" (list "cat")) + (list + "stemText normalizes and stems" + "[stemText \"Cats Running!\"]" + (list "cat runn")) + (list + "stemTokens list" + "stemTokens \"boxes and cats\"" + (list "box" "and" "cat")) + (list + "indexStemmed unifies plural" + "map fst (lookupTerm \"cat\" (indexStemmed 2 \"a cat\" (indexStemmed 1 \"the cats\" emptyIndex)))" + (list 1 2)) + (list + "indexStemmed stem query" + "map fst (lookupTerm (stem \"boxes\") (indexStemmed 1 \"many boxes\" emptyIndex))" + (list 1)))) + +(define + stem-results + (search-batch "" (map (fn (c) (nth c 1)) stem-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth stem-results i) (nth c 2))) + stem-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index b2702653..c2c71b7b 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -116,10 +116,16 @@ lib/search/index.sx lib/search/eval.sx - [x] result pagination (offset / limit) — `paginate`, `pageTfIdf`, `pageBm25`, `resultCount` — 12 tests - [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests -- [ ] stemming (suffix stripping) — recall-improving normalizer +- [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed` + — 18 tests ## Progress log +- **Extension: stemming (196/196 total).** Deterministic English suffix stripping + (`stem`), `stemText`/`stemTokens`, `indexStemmed`. Two haskell-on-sx gotchas: take/drop + over a String yield char CODES not char strings (rebuild via `joinChars . map chr`), + and isSuffixOf's `reverse` trips `++` on the String repr (manual suffix compare). All + five planned extensions now done; the loop can keep adding search vocabulary. 18 tests. - **Extension: highlight/snippet (178/178 total).** `highlight terms text` marks query-matching (normalized) tokens with [..]; `snippet ctx terms text` extracts a context window around the first match. 12 tests. From cf4e613e433c70ace0cfa4f9858bb09383478ae0 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 23:01:42 +0000 Subject: [PATCH 12/15] search: proximity/NEAR search + 9 tests nearDocs k t1 t2 returns docs where both terms occur within k positions (unordered); candidates from the posting intersection, filtered on positional postings. 205/205. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 ++++-- lib/search/conformance.conf | 2 ++ lib/search/near.sx | 8 ++++++ lib/search/scoreboard.json | 9 ++++--- lib/search/scoreboard.md | 3 ++- lib/search/tests/near.sx | 49 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 4 +++ 7 files changed, 75 insertions(+), 7 deletions(-) create mode 100644 lib/search/near.sx create mode 100644 lib/search/tests/near.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 5ac85924..84f5e943 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -6,7 +6,8 @@ ;; rankTfIdf, rankBm25, topNTfIdf, topNBm25, fedIndex, aclFilter, searchTfIdfAcl, ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, -;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed. +;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed, +;; nearDocs. (define search/src @@ -31,4 +32,6 @@ "\n" search/highlight-src "\n" - search/stem-src)) + search/stem-src + "\n" + search/near-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 8c5375b7..f92d61f5 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -29,6 +29,7 @@ PRELOADS=( lib/search/fuzzy.sx lib/search/highlight.sx lib/search/stem.sx + lib/search/near.sx lib/search/api.sx lib/search/testlib.sx ) @@ -44,4 +45,5 @@ SUITES=( "fuzzy:lib/search/tests/fuzzy.sx" "highlight:lib/search/tests/highlight.sx" "stem:lib/search/tests/stem.sx" + "near:lib/search/tests/near.sx" ) diff --git a/lib/search/near.sx b/lib/search/near.sx new file mode 100644 index 00000000..93893abc --- /dev/null +++ b/lib/search/near.sx @@ -0,0 +1,8 @@ +;; search proximity (NEAR) — Haskell source fragment. Depends on query (posIn, +;; docsWith, sortedInter). Finds docs where two terms occur within k positions of +;; each other (unordered), using the positional postings. +;; nearDocs :: Int -> Term -> Term -> Index -> [DocId] (sorted) + +(define + search/near-src + "nrAbsDiff a b = if a > b then a - b else b - a\nnrCloseTo k x [] = False\nnrCloseTo k x (y:ys) = if nrAbsDiff x y <= k then True else nrCloseTo k x ys\nnrAnyClose k [] ys = False\nnrAnyClose k (x:xs) ys = if nrCloseTo k x ys then True else nrAnyClose k xs ys\nnearInDoc k t1 t2 d idx = nrAnyClose k (posIn t1 d idx) (posIn t2 d idx)\nnearHere k t1 t2 idx d = nearInDoc k t1 t2 d idx\nnearDocs k t1 t2 idx = filter (nearHere k t1 t2 idx) (sortedInter (docsWith t1 idx) (docsWith t2 idx))\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 4c88e5e3..a7c01f7d 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 196, + "total_passed": 205, "total_failed": 0, - "total": 196, + "total": 205, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -13,7 +13,8 @@ {"name":"page","passed":12,"failed":0,"total":12}, {"name":"fuzzy","passed":18,"failed":0,"total":18}, {"name":"highlight","passed":12,"failed":0,"total":12}, - {"name":"stem","passed":18,"failed":0,"total":18} + {"name":"stem","passed":18,"failed":0,"total":18}, + {"name":"near","passed":9,"failed":0,"total":9} ], - "generated": "2026-06-06T22:49:33+00:00" + "generated": "2026-06-06T23:01:07+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 7e20b449..985b7b97 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**196 / 196 passing** (0 failure(s)). +**205 / 205 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -14,3 +14,4 @@ | fuzzy | 18 | 18 | ok | | highlight | 12 | 12 | ok | | stem | 18 | 18 | ok | +| near | 9 | 9 | ok | diff --git a/lib/search/tests/near.sx b/lib/search/tests/near.sx new file mode 100644 index 00000000..0caa32a8 --- /dev/null +++ b/lib/search/tests/near.sx @@ -0,0 +1,49 @@ +;; Extension — proximity (NEAR) search: terms within k positions, unordered. +;; Corpus: +;; 1 "the quick brown fox" the0 quick1 brown2 fox3 +;; 2 "quick the lazy fox dog" quick0 the1 lazy2 fox3 dog4 +;; 3 "fox runs quick" fox0 runs1 quick2 + +(define + near-setup + "idx = indexDoc 3 \"fox runs quick\" (indexDoc 2 \"quick the lazy fox dog\" (indexDoc 1 \"the quick brown fox\" emptyIndex))\n") + +(define + near-cases + (list + (list + "near adjacent one doc" + "nearDocs 1 \"quick\" \"brown\" idx" + (list 1)) + (list + "near adjacent both docs" + "nearDocs 1 \"quick\" \"the\" idx" + (list 1 2)) + (list + "near within 2" + "nearDocs 2 \"quick\" \"fox\" idx" + (list 1 3)) + (list "near too far at k1" "nearDocs 1 \"quick\" \"fox\" idx" (list)) + (list + "near unordered symmetric" + "nearDocs 2 \"fox\" \"quick\" idx" + (list 1 3)) + (list "near wider window" "nearDocs 5 \"the\" \"dog\" idx" (list 2)) + (list "near absent term" "nearDocs 1 \"quick\" \"zzz\" idx" (list)) + (list "near needs both terms" "nearDocs 3 \"brown\" \"dog\" idx" (list)) + (list + "near same docs only" + "nearDocs 3 \"fox\" \"runs\" idx" + (list 3)))) + +(define + near-results + (search-batch near-setup (map (fn (c) (nth c 1)) near-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth near-results i) (nth c 2))) + near-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index c2c71b7b..775aa82a 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -118,9 +118,13 @@ lib/search/index.sx lib/search/eval.sx - [x] snippet / highlight generation (`highlight`, `snippet`) — 12 tests - [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed` — 18 tests +- [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests ## Progress log +- **Extension: proximity/NEAR (205/205 total).** `nearDocs k t1 t2 idx` returns docs + where both terms occur within k positions (unordered), candidates = posting + intersection, filtered on the positional postings. 9 tests. - **Extension: stemming (196/196 total).** Deterministic English suffix stripping (`stem`), `stemText`/`stemTokens`, `indexStemmed`. Two haskell-on-sx gotchas: take/drop over a String yield char CODES not char strings (rebuild via `joinChars . map chr`), From cfa68c3db34c2d1405ebaaadb3369d50493d98f0 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 23:27:03 +0000 Subject: [PATCH 13/15] search: synonym / query expansion + 9 tests A synonym map [(Term,[Term])] expands a query term to itself + synonyms (expandTerm); synDocs unions and synRankTfIdf ranks the expanded set. 214/214. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 +++-- lib/search/conformance.conf | 2 ++ lib/search/scoreboard.json | 9 ++++--- lib/search/scoreboard.md | 3 ++- lib/search/syn.sx | 10 +++++++ lib/search/tests/syn.sx | 53 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 4 +++ 7 files changed, 80 insertions(+), 7 deletions(-) create mode 100644 lib/search/syn.sx create mode 100644 lib/search/tests/syn.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 84f5e943..14ba1774 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -7,7 +7,7 @@ ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, ;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed, -;; nearDocs. +;; nearDocs, expandTerm, synDocs, synRankTfIdf. (define search/src @@ -34,4 +34,6 @@ "\n" search/stem-src "\n" - search/near-src)) + search/near-src + "\n" + search/syn-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index f92d61f5..96d38540 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -30,6 +30,7 @@ PRELOADS=( lib/search/highlight.sx lib/search/stem.sx lib/search/near.sx + lib/search/syn.sx lib/search/api.sx lib/search/testlib.sx ) @@ -46,4 +47,5 @@ SUITES=( "highlight:lib/search/tests/highlight.sx" "stem:lib/search/tests/stem.sx" "near:lib/search/tests/near.sx" + "syn:lib/search/tests/syn.sx" ) diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index a7c01f7d..6f965309 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 205, + "total_passed": 214, "total_failed": 0, - "total": 205, + "total": 214, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -14,7 +14,8 @@ {"name":"fuzzy","passed":18,"failed":0,"total":18}, {"name":"highlight","passed":12,"failed":0,"total":12}, {"name":"stem","passed":18,"failed":0,"total":18}, - {"name":"near","passed":9,"failed":0,"total":9} + {"name":"near","passed":9,"failed":0,"total":9}, + {"name":"syn","passed":9,"failed":0,"total":9} ], - "generated": "2026-06-06T23:01:07+00:00" + "generated": "2026-06-06T23:25:35+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 985b7b97..0f54edbb 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**205 / 205 passing** (0 failure(s)). +**214 / 214 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -15,3 +15,4 @@ | highlight | 12 | 12 | ok | | stem | 18 | 18 | ok | | near | 9 | 9 | ok | +| syn | 9 | 9 | ok | diff --git a/lib/search/syn.sx b/lib/search/syn.sx new file mode 100644 index 00000000..6072cd65 --- /dev/null +++ b/lib/search/syn.sx @@ -0,0 +1,10 @@ +;; search synonym / query expansion — Haskell source fragment. Depends on index + +;; rank. A synonym map is an assoc list [(Term, [Term])]; a query term is expanded +;; to itself plus its synonyms, then the expanded set is unioned / ranked. +;; expandTerm :: [(Term,[Term])] -> Term -> [Term] +;; synDocs :: [(Term,[Term])] -> Term -> Index -> [DocId] +;; synRankTfIdf :: [(Term,[Term])] -> Term -> Index -> [DocId] + +(define + search/syn-src + "synLookup synmap t = case lookup t synmap of { Nothing -> [] ; Just ss -> ss }\nexpandTerm synmap t = t : synLookup synmap t\nsynDocs synmap t idx = foldl (candStep idx) [] (expandTerm synmap t)\nsynRankTfIdf synmap t idx = rankTfIdf (expandTerm synmap t) idx\n") diff --git a/lib/search/tests/syn.sx b/lib/search/tests/syn.sx new file mode 100644 index 00000000..aaeea7bd --- /dev/null +++ b/lib/search/tests/syn.sx @@ -0,0 +1,53 @@ +;; Extension — synonym / query expansion. +;; synmap: car -> automobile, vehicle ; big -> large +;; Corpus: 1 "fast car" 2 "shiny automobile" 3 "big truck" 4 "large house" 5 "vehicle review" + +(define + syn-setup + "synmap = [(\"car\", [\"automobile\", \"vehicle\"]), (\"big\", [\"large\"])]\nidx = indexDoc 5 \"vehicle review\" (indexDoc 4 \"large house\" (indexDoc 3 \"big truck\" (indexDoc 2 \"shiny automobile\" (indexDoc 1 \"fast car\" emptyIndex))))\n") + +(define + syn-cases + (list + (list + "expand term with synonyms" + "expandTerm synmap \"car\"" + (list "car" "automobile" "vehicle")) + (list + "expand single synonym" + "expandTerm synmap \"big\"" + (list "big" "large")) + (list "expand unknown term" "expandTerm synmap \"banana\"" (list "banana")) + (list + "syn docs union" + "synDocs synmap \"car\" idx" + (list 1 2 5)) + (list + "syn docs single synonym" + "synDocs synmap \"big\" idx" + (list 3 4)) + (list + "syn docs no synonyms" + "synDocs synmap \"house\" idx" + (list 4)) + (list "syn docs absent" "synDocs synmap \"plane\" idx" (list)) + (list + "syn rank expanded" + "synRankTfIdf synmap \"car\" idx" + (list 1 2 5)) + (list + "syn rank single" + "synRankTfIdf synmap \"big\" idx" + (list 3 4)))) + +(define + syn-results + (search-batch syn-setup (map (fn (c) (nth c 1)) syn-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth syn-results i) (nth c 2))) + syn-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 775aa82a..2e62c53b 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -119,9 +119,13 @@ lib/search/index.sx lib/search/eval.sx - [x] stemming (suffix stripping) — `stem`, `stemText`, `stemTokens`, `indexStemmed` — 18 tests - [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests +- [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests ## Progress log +- **Extension: synonyms/query expansion (214/214 total).** A synonym map + `[(Term,[Term])]` expands a query term to itself + synonyms (`expandTerm`); `synDocs` + unions, `synRankTfIdf` ranks the expanded set. 9 tests. - **Extension: proximity/NEAR (205/205 total).** `nearDocs k t1 t2 idx` returns docs where both terms occur within k positions (unordered), candidates = posting intersection, filtered on the positional postings. 9 tests. From db2a5dc6ab255cceeca37b058332c8224924dc31 Mon Sep 17 00:00:00 2001 From: giles Date: Sat, 6 Jun 2026 23:58:37 +0000 Subject: [PATCH 14/15] search: boolean-filtered ranked search + 11 tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit searchRankTfIdf/searchRankBm25 parse a boolean query, filter docs via evalQuery, then rank survivors by relevance over the query's leaf terms (queryTerms) — the filter-then-rank pattern. 225/225. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 7 ++-- lib/search/conformance.conf | 2 ++ lib/search/rankq.sx | 11 ++++++ lib/search/scoreboard.json | 9 ++--- lib/search/scoreboard.md | 3 +- lib/search/tests/rankq.sx | 67 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 6 ++++ 7 files changed, 98 insertions(+), 7 deletions(-) create mode 100644 lib/search/rankq.sx create mode 100644 lib/search/tests/rankq.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 14ba1774..29f445af 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -7,7 +7,8 @@ ;; topNTfIdfAcl, searchBm25Acl, prefixTerms, prefixDocs, prefixRankTfIdf, ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, ;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed, -;; nearDocs, expandTerm, synDocs, synRankTfIdf. +;; nearDocs, expandTerm, synDocs, synRankTfIdf, queryTerms, searchRankTfIdf, +;; searchRankBm25. (define search/src @@ -36,4 +37,6 @@ "\n" search/near-src "\n" - search/syn-src)) + search/syn-src + "\n" + search/rankq-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 96d38540..9c7b006e 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -31,6 +31,7 @@ PRELOADS=( lib/search/stem.sx lib/search/near.sx lib/search/syn.sx + lib/search/rankq.sx lib/search/api.sx lib/search/testlib.sx ) @@ -48,4 +49,5 @@ SUITES=( "stem:lib/search/tests/stem.sx" "near:lib/search/tests/near.sx" "syn:lib/search/tests/syn.sx" + "rankq:lib/search/tests/rankq.sx" ) diff --git a/lib/search/rankq.sx b/lib/search/rankq.sx new file mode 100644 index 00000000..77b70468 --- /dev/null +++ b/lib/search/rankq.sx @@ -0,0 +1,11 @@ +;; search boolean-filtered ranked search — Haskell source fragment. +;; Depends on parse (parseQuery/Query), query (evalQuery), rank (tfidfDoc/bm25Doc/ +;; cmpScore). Filters by the boolean query, then ranks the surviving docs by +;; relevance over the query's leaf terms — the real-world filter-then-rank pattern. +;; queryTerms :: Query -> [Term] +;; searchRankTfIdf :: String -> Index -> [DocId] +;; searchRankBm25 :: Float -> Float -> String -> Index -> [DocId] + +(define + search/rankq-src + "queryTerms (Term t) = [t]\nqueryTerms (And a b) = queryTerms a ++ queryTerms b\nqueryTerms (Or a b) = queryTerms a ++ queryTerms b\nqueryTerms (Not a) = queryTerms a\nqueryTerms (Phrase ts) = ts\nmkSubPair f terms idx d = (f terms idx d, d)\nrankSubsetWith f terms docs idx = map snd (sortBy cmpScore (map (mkSubPair f terms idx) docs))\nsearchRankTfIdf s idx = let q = parseQuery s in rankSubsetWith tfidfDoc (queryTerms q) (evalQuery idx q) idx\nsearchRankBm25 k1 b s idx = let q = parseQuery s in rankSubsetWith (bm25Doc k1 b) (queryTerms q) (evalQuery idx q) idx\n") diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 6f965309..3ea5b5ee 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 214, + "total_passed": 225, "total_failed": 0, - "total": 214, + "total": 225, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -15,7 +15,8 @@ {"name":"highlight","passed":12,"failed":0,"total":12}, {"name":"stem","passed":18,"failed":0,"total":18}, {"name":"near","passed":9,"failed":0,"total":9}, - {"name":"syn","passed":9,"failed":0,"total":9} + {"name":"syn","passed":9,"failed":0,"total":9}, + {"name":"rankq","passed":11,"failed":0,"total":11} ], - "generated": "2026-06-06T23:25:35+00:00" + "generated": "2026-06-06T23:58:05+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 0f54edbb..2cc7fd9c 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**214 / 214 passing** (0 failure(s)). +**225 / 225 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -16,3 +16,4 @@ | stem | 18 | 18 | ok | | near | 9 | 9 | ok | | syn | 9 | 9 | ok | +| rankq | 11 | 11 | ok | diff --git a/lib/search/tests/rankq.sx b/lib/search/tests/rankq.sx new file mode 100644 index 00000000..dd360310 --- /dev/null +++ b/lib/search/tests/rankq.sx @@ -0,0 +1,67 @@ +;; Extension — boolean-filtered ranked search (filter then rank by relevance). +;; Corpus: +;; 1 "apple apple banana" apple2 banana1 +;; 2 "apple cherry" apple1 cherry1 +;; 3 "banana cherry" banana1 cherry1 +;; 4 "apple banana cherry" apple1 banana1 cherry1 + +(define + rankq-setup + "idx = indexDoc 4 \"apple banana cherry\" (indexDoc 3 \"banana cherry\" (indexDoc 2 \"apple cherry\" (indexDoc 1 \"apple apple banana\" emptyIndex)))\n") + +(define + rankq-cases + (list + (list + "queryTerms and" + "queryTerms (parseQuery \"apple AND banana\")" + (list "apple" "banana")) + (list + "queryTerms or not" + "queryTerms (parseQuery \"a OR NOT b\")" + (list "a" "b")) + (list + "queryTerms phrase" + "queryTerms (parseQuery \"\\\"x y\\\" OR z\")" + (list "x" "y" "z")) + (list + "and filter ranked by tf" + "searchRankTfIdf \"apple AND banana\" idx" + (list 1 4)) + (list + "single term ranked tie" + "searchRankTfIdf \"cherry\" idx" + (list 2 3 4)) + (list + "or filter ranked" + "searchRankTfIdf \"apple OR banana\" idx" + (list 1 4 2 3)) + (list + "and-not narrows then ranks" + "searchRankTfIdf \"apple AND NOT banana\" idx" + (list 2)) + (list + "phrase filter ranked" + "searchRankTfIdf \"\\\"apple banana\\\"\" idx" + (list 1 4)) + (list "no matches" "searchRankTfIdf \"zzz\" idx" (list)) + (list + "bm25 boolean ranked subset" + "sort (searchRankBm25 1.5 0.75 \"apple OR banana\" idx)" + (list 1 2 3 4)) + (list + "bm25 and filter" + "searchRankBm25 1.5 0.75 \"apple AND NOT banana\" idx" + (list 2)))) + +(define + rankq-results + (search-batch rankq-setup (map (fn (c) (nth c 1)) rankq-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth rankq-results i) (nth c 2))) + rankq-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index 2e62c53b..cf8c530a 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -120,9 +120,15 @@ lib/search/index.sx lib/search/eval.sx — 18 tests - [x] proximity / NEAR — `nearDocs k t1 t2` (unordered, within k positions) — 9 tests - [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests +- [x] boolean-filtered ranked search — `queryTerms`, `searchRankTfIdf`, + `searchRankBm25` (filter by boolean query, rank survivors by relevance) — 11 tests ## Progress log +- **Extension: boolean-filtered ranked search (225/225 total).** `searchRankTfIdf`/ + `searchRankBm25` parse a boolean query, filter docs via evalQuery, then rank the + survivors by relevance over the query's leaf terms (`queryTerms`) — the real-world + filter-then-rank pattern. 11 tests. - **Extension: synonyms/query expansion (214/214 total).** A synonym map `[(Term,[Term])]` expands a query term to itself + synonyms (`expandTerm`); `synDocs` unions, `synRankTfIdf` ranks the expanded set. 9 tests. From 5d62d08e1c5786b25863bd4f19539631fc5fb568 Mon Sep 17 00:00:00 2001 From: giles Date: Sun, 7 Jun 2026 00:46:22 +0000 Subject: [PATCH 15/15] search: did-you-mean spelling suggestion + 9 tests suggest/suggestN rank indexed terms by edit distance to a (misspelled) query term, alphabetical tiebreak. 234/234. Co-Authored-By: Claude Opus 4.8 (1M context) --- lib/search/api.sx | 6 ++++-- lib/search/conformance.conf | 2 ++ lib/search/scoreboard.json | 9 ++++---- lib/search/scoreboard.md | 3 ++- lib/search/suggest.sx | 9 ++++++++ lib/search/tests/suggest.sx | 42 +++++++++++++++++++++++++++++++++++++ plans/search-on-sx.md | 5 +++++ 7 files changed, 69 insertions(+), 7 deletions(-) create mode 100644 lib/search/suggest.sx create mode 100644 lib/search/tests/suggest.sx diff --git a/lib/search/api.sx b/lib/search/api.sx index 29f445af..dd66031b 100644 --- a/lib/search/api.sx +++ b/lib/search/api.sx @@ -8,7 +8,7 @@ ;; paginate, pageTfIdf, pageBm25, resultCount, editDist, fuzzyTerms, fuzzyDocs, ;; fuzzyRankTfIdf, highlight, snippet, stem, stemText, stemTokens, indexStemmed, ;; nearDocs, expandTerm, synDocs, synRankTfIdf, queryTerms, searchRankTfIdf, -;; searchRankBm25. +;; searchRankBm25, suggestN, suggest. (define search/src @@ -39,4 +39,6 @@ "\n" search/syn-src "\n" - search/rankq-src)) + search/rankq-src + "\n" + search/suggest-src)) diff --git a/lib/search/conformance.conf b/lib/search/conformance.conf index 9c7b006e..ec0fa631 100644 --- a/lib/search/conformance.conf +++ b/lib/search/conformance.conf @@ -32,6 +32,7 @@ PRELOADS=( lib/search/near.sx lib/search/syn.sx lib/search/rankq.sx + lib/search/suggest.sx lib/search/api.sx lib/search/testlib.sx ) @@ -50,4 +51,5 @@ SUITES=( "near:lib/search/tests/near.sx" "syn:lib/search/tests/syn.sx" "rankq:lib/search/tests/rankq.sx" + "suggest:lib/search/tests/suggest.sx" ) diff --git a/lib/search/scoreboard.json b/lib/search/scoreboard.json index 3ea5b5ee..d548e4b3 100644 --- a/lib/search/scoreboard.json +++ b/lib/search/scoreboard.json @@ -1,8 +1,8 @@ { "lang": "search", - "total_passed": 225, + "total_passed": 234, "total_failed": 0, - "total": 225, + "total": 234, "suites": [ {"name":"index","passed":18,"failed":0,"total":18}, {"name":"boolean","passed":28,"failed":0,"total":28}, @@ -16,7 +16,8 @@ {"name":"stem","passed":18,"failed":0,"total":18}, {"name":"near","passed":9,"failed":0,"total":9}, {"name":"syn","passed":9,"failed":0,"total":9}, - {"name":"rankq","passed":11,"failed":0,"total":11} + {"name":"rankq","passed":11,"failed":0,"total":11}, + {"name":"suggest","passed":9,"failed":0,"total":9} ], - "generated": "2026-06-06T23:58:05+00:00" + "generated": "2026-06-07T00:44:05+00:00" } diff --git a/lib/search/scoreboard.md b/lib/search/scoreboard.md index 2cc7fd9c..4a59608e 100644 --- a/lib/search/scoreboard.md +++ b/lib/search/scoreboard.md @@ -1,6 +1,6 @@ # search scoreboard -**225 / 225 passing** (0 failure(s)). +**234 / 234 passing** (0 failure(s)). | Suite | Passed | Total | Status | |-------|--------|-------|--------| @@ -17,3 +17,4 @@ | near | 9 | 9 | ok | | syn | 9 | 9 | ok | | rankq | 11 | 11 | ok | +| suggest | 9 | 9 | ok | diff --git a/lib/search/suggest.sx b/lib/search/suggest.sx new file mode 100644 index 00000000..7b06b1fb --- /dev/null +++ b/lib/search/suggest.sx @@ -0,0 +1,9 @@ +;; search did-you-mean / spelling suggestion — Haskell source fragment. +;; Depends on fuzzy (editDist) + index (allTerms). Ranks indexed terms by edit +;; distance to a (possibly misspelled) query term; ties broken alphabetically. +;; suggestN :: Int -> String -> Index -> [Term] +;; suggest :: String -> Index -> Term ("" if the index has no terms) + +(define + search/suggest-src + "sgMk term t = (editDist term t, t)\nsgPairs term idx = map (sgMk term) (allTerms idx)\nsgCmp p1 p2 = if fst p1 < fst p2 then LT else if fst p1 > fst p2 then GT else compare (snd p1) (snd p2)\nsuggestN n term idx = take n (map snd (sortBy sgCmp (sgPairs term idx)))\nsgHead [] = \"\"\nsgHead (x:xs) = x\nsuggest term idx = sgHead (suggestN 1 term idx)\n") diff --git a/lib/search/tests/suggest.sx b/lib/search/tests/suggest.sx new file mode 100644 index 00000000..164b43ec --- /dev/null +++ b/lib/search/tests/suggest.sx @@ -0,0 +1,42 @@ +;; Extension — did-you-mean / spelling suggestion. +;; Corpus terms (sorted): ample apple apply banana orange + +(define + suggest-setup + "idx = indexDoc 1 \"apple apply ample banana orange\" emptyIndex\n") + +(define + suggest-cases + (list + (list "suggest exact term" "[suggest \"apple\" idx]" (list "apple")) + (list + "suggest misspelled banana" + "[suggest \"bananna\" idx]" + (list "banana")) + (list + "suggest missing letter orange" + "[suggest \"orang\" idx]" + (list "orange")) + (list "suggest closest apply" "[suggest \"aply\" idx]" (list "apply")) + (list "suggestN 1 banana" "suggestN 1 \"bananna\" idx" (list "banana")) + (list + "suggestN 2 ties alpha" + "suggestN 2 \"aple\" idx" + (list "ample" "apple")) + (list "suggest empty term shortest" "[suggest \"\" idx]" (list "ample")) + (list "suggest empty index" "[suggest \"apple\" emptyIndex]" (list "")) + (list "suggestN empty index" "suggestN 1 \"apple\" emptyIndex" (list)))) + +(define + suggest-results + (search-batch + suggest-setup + (map (fn (c) (nth c 1)) suggest-cases))) + +(map-indexed + (fn + (i c) + (hk-test (nth c 0) (nth suggest-results i) (nth c 2))) + suggest-cases) + +{:fail hk-test-fail :pass hk-test-pass :fails hk-test-fails} diff --git a/plans/search-on-sx.md b/plans/search-on-sx.md index cf8c530a..4cd93e8f 100644 --- a/plans/search-on-sx.md +++ b/plans/search-on-sx.md @@ -122,9 +122,14 @@ lib/search/index.sx lib/search/eval.sx - [x] synonym / query expansion — `expandTerm`, `synDocs`, `synRankTfIdf` — 9 tests - [x] boolean-filtered ranked search — `queryTerms`, `searchRankTfIdf`, `searchRankBm25` (filter by boolean query, rank survivors by relevance) — 11 tests +- [x] did-you-mean / spelling suggestion — `suggest`, `suggestN` (closest indexed + terms by edit distance, alphabetical tiebreak) — 9 tests ## Progress log +- **Extension: did-you-mean / spelling suggestion (234/234 total).** `suggest`/`suggestN` + rank indexed terms by edit distance to a (misspelled) query term, alphabetical + tiebreak. 9 tests. - **Extension: boolean-filtered ranked search (225/225 total).** `searchRankTfIdf`/ `searchRankBm25` parse a boolean query, filter docs via evalQuery, then rank the survivors by relevance over the query's leaf terms (`queryTerms`) — the real-world