search: Phase 3 ranking TF-IDF + BM25 + top-N + 23 tests

rankTfIdf and rankBm25 (configurable k1/b) over the candidate set, float scores with deterministic DocId tiebreak; topNTfIdf/topNBm25. df/idf derived from posting-list length. Tests cover tf/idf behavior, a BM25-vs-TF-IDF flip from length-norm + tf-saturation, the b-parameter effect, tiebreak stability. 101/101. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:56:50 +00:00
parent 4c84decc01
commit a3f9d4f6c9
7 changed files with 132 additions and 14 deletions
--- a/plans/search-on-sx.md
+++ b/plans/search-on-sx.md
@@ -10,7 +10,7 @@ extension that merges per-peer indices.

 ## Status (rolling)

-`bash lib/search/conformance.sh` → **78/78** (Phases 1–2 complete)
+`bash lib/search/conformance.sh` → **101/101** (Phases 1–3 complete)

 ## Ground rules

@@ -89,12 +89,13 @@ lib/search/index.sx                     lib/search/eval.sx

 ## Phase 3 — Ranking

- [ ] document frequency tracking — extend index with `df` per term
- [ ] TF-IDF scoring
- [ ] BM25 scoring (configurable k1, b)
- [ ] top-N retrieval (heap-based)
- [ ] `lib/search/tests/rank.sx` — 20+ cases: TF-IDF behavior, BM25 vs TF-IDF,
-  ranking stability, top-N correctness
+- [x] document frequency — `docFreq`/`idf`/`bm25idf` derived from the index
+  (posting-list length); no separate df store needed
+- [x] TF-IDF scoring (`rankTfIdf`)
+- [x] BM25 scoring, configurable k1/b (`rankBm25 k1 b`)
+- [x] top-N retrieval (`topNTfIdf`/`topNBm25` — sortBy + take; stable DocId tiebreak)
+- [x] `lib/search/tests/rank.sx` — 23 cases: TF-IDF tf/idf behavior, BM25 length-norm
+  + tf-saturation flips vs TF-IDF, b-parameter effect, tiebreak stability, top-N

 ## Phase 4 — ACL filter + federation

@@ -105,6 +106,12 @@ lib/search/index.sx                     lib/search/eval.sx

 ## Progress log

+- **Phase 3 complete — ranking (101/101 total).** TF-IDF (`rankTfIdf`) and BM25
+  (`rankBm25 k1 b`) over the candidate set (docs containing any query term), scores
+  as floats with deterministic DocId-ascending tiebreak; `topNTfIdf`/`topNBm25` via
+  sortBy+take. df/idf derived from posting-list length (no separate df store). 23
+  tests incl. a BM25-vs-TF-IDF flip (length-norm + tf-saturation) and the b-parameter
+  effect. Float division/`log`/float literals all work in haskell-on-sx.
 - **Phase 2 complete — parser (78/78 total).** Query tokenizer (ord-based
  delimiters, quoted phrases) + recursive-descent parser with OR<AND<NOT precedence,
  implicit AND on adjacency, parens, case-insensitive keywords. `parseQuery`,