search: Phase 1 tokenizer + inverted index + 18 tests
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 53s

Tokenizer (lowercase, strip punctuation, positions) and a sorted assoc-list
inverted index [(Term,[(DocId,[Pos])])] with indexDoc/deleteDoc/lookupTerm/
docFreq/allTerms. Search lib is haskell-on-sx source assembled into search/src;
tests reuse hk-test counters via a search-eval helper. conformance.sh models
lib/haskell.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-06 18:21:49 +00:00
parent e2de5a4675
commit b8cf3eb1b8
10 changed files with 252 additions and 11 deletions

View File

@@ -10,7 +10,7 @@ extension that merges per-peer indices.
## Status (rolling)
`bash lib/search/conformance.sh`**0/0** (not yet started)
`bash lib/search/conformance.sh`**18/18** (Phase 1 complete)
## Ground rules
@@ -61,15 +61,18 @@ lib/search/index.sx lib/search/eval.sx
## Phase 1 — Tokenize + index
- [ ] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
- [x] `lib/search/tokenize.sx` — normalize (lowercase, strip punctuation), split on
whitespace, return positions
- [ ] `lib/search/index.sx` — inverted index data structure (typed `Map` from
haskell lib); `insert`, `delete`, `lookup`
- [ ] `lib/search/api.sx``(search/index doc)`, `(search/lookup term)`
- [ ] `lib/search/tests/index.sx` — 15+ cases: tokenize, insert + lookup, update,
delete, multi-doc
- [ ] `lib/search/scoreboard.{json,md}`
- [ ] `lib/search/conformance.sh`
- [x] `lib/search/index.sx` — inverted index data structure; `indexDoc`, `deleteDoc`,
`lookupTerm`, `docFreq`, `allTerms`. (Data.Map's public API lacks
toList/keys/map/filter, so a sorted assoc-list `[(Term,[(DocId,[Pos])])]` is used —
the conceptual `Map Term [(DocId,[Pos])]` with free term iteration.)
- [x] `lib/search/api.sx` — assembles `search/src` (tokenize + index); Haskell entry
points `indexDoc` / `lookupTerm`
- [x] `lib/search/tests/index.sx` — 18 cases: tokenize, insert + lookup, update,
delete, multi-doc, positions, docFreq, allTerms
- [x] `lib/search/scoreboard.{json,md}`
- [x] `lib/search/conformance.sh`
## Phase 2 — Query AST + boolean evaluation
@@ -99,8 +102,19 @@ lib/search/index.sx lib/search/eval.sx
## Progress log
(loop fills this in)
- **Phase 1 complete (18/18).** Tokenizer (lowercase + strip punctuation + positions),
inverted index as sorted assoc-list `[(Term,[(DocId,[Pos])])]`, indexDoc/deleteDoc/
lookupTerm/docFreq/allTerms. Search lib is Haskell source assembled into `search/src`
and evaluated via the haskell-on-sx interpreter; tests reuse `hk-test` counters and a
`search-eval` helper that forces HK values to plain SX. conformance.sh models
lib/haskell (MODE=counters, COUNTERS_PASS/FAIL=hk-test-pass/fail).
## Blockers
(loop fills this in)
- **None.** Note: the box is heavily CPU-oversubscribed by sibling loop agents
(load ~11 on 2 cores); each program eval is ~10× slower than nominal, so suite
timeout is set to 600s. Runs are correct, just slow.
- **Data.Map public API gap (informational, not fixing):** the haskell-on-sx
`import Data.Map` binds only empty/singleton/insert/lookup/member/size/null/delete/
insertWith/adjust/findWithDefault — no toList/keys/elems/map/filter/unionWith. Index
uses a pure assoc-list instead so term iteration and federation merge stay simple.