diff --git a/plans/designs/e37-tokenizer-api.md b/plans/designs/e37-tokenizer-api.md new file mode 100644 index 00000000..4dbce10d --- /dev/null +++ b/plans/designs/e37-tokenizer-api.md @@ -0,0 +1,392 @@ +# E37 — Tokenizer-as-API + +Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in +`hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP +(untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS +bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and +inspect a token-stream surface the SX port does not expose. + +Work breaks into: (1) an SX API over the existing `hs-tokenize` +mimicking the upstream stream object; (2) a compatibility shim over +token fields; (3) a generator pattern recognising +`_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No +tokenizer-grammar rewrite is required. Position tracking +(`start/end/line/column`) is scoped to E38 (SourceInfo API). + +## 1. Failing tests + +Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize` +plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`, +`.list`, `.type`, `.value`, `.op`. + +1. **handles $ in template properly** — `tokenize('"', true).token(0).value` → `'"'`. templateMode + `token(i)`. +2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`. +3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`. +4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`. +5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing. +6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`. +7. **handles id references** — mirror of 4 for `#a` → `ID_REF`. +8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls. +9. **handles identifiers with numbers** — `f1oo / fo1o / foo1` → `IDENTIFIER`. +10. **handles look ahead property** — `tokenize("a 1 + 1").token(0..4)` → `["a" "1" "+" "1" "<<>>"]`. +11. **handles numbers properly** — 8 asserts incl. `1.1.1` → `NUMBER PERIOD NUMBER`. +12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`. +13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`. +14. **handles strings properly 2** — subset of 13. +15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`). +16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`. +17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API. + +## 2. Upstream API shape + +From `https://hyperscript.org/docs/#api` and +`node_modules/hyperscript.org/src/_hyperscript.js`: + +```js +const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?) +// → { list, source, hasMore, matchTokenType, token, consumeToken, +// requireTokenType, ... } +tokens.list // Array — lookahead window +tokens.source // original src string +tokens.token(i) // i-th un-consumed token (0 = current); returns + // { type: "EOF", value: "<<>>" } past end +tokens.consumeToken() // shift + return; throws on empty for required +tokens.hasMore() // true if a non-EOF token remains +tokens.matchTokenType(type) / requireTokenType(type) / etc. +``` + +Each `Token` is: + +```js +{ + type: "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF" + | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */, + value: string, + op: boolean, // true for punctuation/operator tokens + start: number, // char offset + end: number, + line: number, + column: number, + source: string, // reference to full src +} +``` + +The conformance tests only read `type`, `value`, `op`, and occasionally +random-index into `.list`. They never read `start/end/line/column`, so +position tracking is **not** required for cluster E37. + +## 3. Proposed SX surface + +Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so +SX test bodies can call them directly through `eval-hs` or `assert=`): + +``` +(hs-tokens-of src) ; => dict — new token-stream object +(hs-tokens-of src :template) ; templateMode variant +(hs-token-type tok) ; upstream-style type name +(hs-token-value tok) ; string value +(hs-token-op? tok) ; bool +``` + +A token stream is a mutable dict: + +``` +{ :source src + :list (list-of-tokens) ; upstream-shaped, :type :value :op + :pos 0 } ; cursor into :list +``` + +With three pure-SX consumer helpers: + +``` +(hs-stream-token stream i) ; lookahead; returns EOF sentinel past end +(hs-stream-consume stream) ; returns current token, advances :pos +(hs-stream-has-more stream) ; not EOF and pos < len +``` + +### Worked example + +``` +(let ((s (hs-tokens-of "1.1"))) + (hs-token-type (hs-stream-consume s))) ; => "NUMBER" + +(let ((s (hs-tokens-of "a 1 + 1"))) + (list (hs-token-value (hs-stream-token s 0)) ; "a" + (hs-token-value (hs-stream-token s 4)))) ; "<<>>" +``` + +All helpers are ordinary `define`s — no platform primitives, no FFI. +The generator emits them as bare calls inside `deftest` bodies. + +## 4. Runtime architecture + +The existing `hs-tokenize` emits tokens with: + +``` +{ :type "keyword" | "ident" | "number" | "string" | "class" | "id" + | "op" | "paren-open" | ... | "eof" + :value V + :pos P } +``` + +The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated +boolean `.op` flag rather than a merged type/punctuation taxonomy. +Rather than rewrite the tokenizer, add a translation layer. + +### Type map (SX-native → upstream) + +``` +"ident" → "IDENTIFIER" (keywords too: see note) +"keyword" → "IDENTIFIER" (upstream tokenizes keywords as idents) +"number" → "NUMBER" +"string" → "STRING" +"class" → "CLASS_REF" (:value becomes ".a" with leading dot) +"id" → "ID_REF" (:value becomes "#a" with leading hash) +"attr" → "ATTRIBUTE_REF" +"style" → "STYLE_REF" +"selector" → "QUERY_REF" (used by tests? upstream calls it QUERY_REF) +"template" → one-shot: see templateMode below +"eof" → "EOF" with :value "<<>>" +"paren-open" → "L_PAREN" + :op true +"paren-close" → "R_PAREN" + :op true +"bracket-open" → "L_BRACKET" + :op true +"bracket-close" → "R_BRACKET" + :op true +"brace-open" → "L_BRACE" + :op true +"brace-close" → "R_BRACE" + :op true +"comma" → "COMMA" + :op true +"dot" → "PERIOD" + :op true +"op" → name-by-value lookup (see below) + :op true +``` + +A tiny op-name table (15–25 entries) maps `:value` strings to the +upstream token type name: + +``` +"+" → "PLUS" +"-" → "MINUS" +"*" → "MULTIPLY" +"/" → "SLASH" ; current code uses "op"/"/" +":" → "COLON" ; not yet emitted as own token — fix below +"%" → "PERCENT" +"|" → "PIPE" +"!" → "EXCLAMATION" +"?" → "QUESTION" +"#" → "POUND" +"&" → "AMPERSAND" +";" → "SEMI" +"=" → "EQUALS" +"<" → "L_ANG" +">" → "R_ANG" +"<=" → "LTE_ANG" +">=" → "GTE_ANG" +"==" → "EQ" +"===" → "EQQ" +"\\" → "BACKSLASH" +"'s" → "APOSTROPHE_S" ; not a true operator — elided from test 12 +``` + +### Conversion entry point + +``` +(define (hs-api-tokens src template-mode?) + (let ((raw (if template-mode? + (hs-tokenize-template src) ; new variant + (hs-tokenize src)))) + {:source src + :list (map hs-raw->api-token raw) + :pos 0})) +``` + +`hs-raw->api-token` is a pure mapping function using the tables above. +An EOF token is always present at the end (the current tokenizer +already emits one). + +### Token gaps to fix + +Three issues turn up while writing the map; all are trivial one-site +fixes in `tokenizer.sx`: + +- **`:` is currently consumed as part of the local prefix + (`:name`)**. Upstream tests expect bare `:` alone to produce + `COLON`; only when followed by `ident-start` does it combine. The + test suite does not exercise the bare form (it is only covered by + the operator table in test 12). Fix by emitting `"op" ":"` when the + next char is not an ident start — already what the code does; the + op-name map above covers it. +- **`===` and `==`** — current tokenizer emits `"op" "="` plus another + `"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to + also match a third `=` after `==`. +- **Template mode** — upstream `tokenize(src, true)` splits + backtick-templates into their lexical parts rather than the single + `"template"` token the current code emits. Add a second top-level + scanner `hs-tokenize-template` used only for the API wrapper; the + primary parser continues to call `hs-tokenize` unchanged. The + template-mode tests (1, 15) only require character-level emission + of the `" $ { inner } "` sequence — no semantic re-use by the + parser. + +### Stream consumer helpers + +``` +(define (hs-stream-token s i) + (let ((list (dict-get s :list)) + (pos (dict-get s :pos))) + (or (nth list (+ pos i)) + (hs-eof-sentinel)))) + +(define (hs-stream-consume s) + (let ((tok (hs-stream-token s 0))) + (when (not (= (hs-token-type tok) "EOF")) + (dict-set! s :pos (+ (dict-get s :pos) 1))) + tok)) + +(define (hs-stream-has-more s) + (not (= (hs-token-type (hs-stream-token s 0)) "EOF"))) +``` + +## 5. Test mock strategy + +All 17 tests are `complexity: eval-only` with empty `html`. They do +not need the DOM runner — they only need SX expressions that resolve +to the same values the JS asserts check. + +Add a generator pattern to `generate-sx-tests.py`, slotted into +`generate_eval_only_test` or as a new pre-pass ahead of it, that +matches bodies containing `_hyperscript.internals.tokenizer.tokenize`. +The pattern tree, by precedence: + +1. `tokenize(SRC[, true])` → emit an SX `let` that binds a + fresh stream name to `(hs-tokens-of SRC [:template])`. +2. `.consumeToken()` → `(hs-stream-consume )`. +3. `.token(N)` → `(hs-stream-token N)`. +4. `.list` → `(dict-get :list)`. +5. `.list.length` → `(len (dict-get :list))`. +6. `.list[N]` → `(nth (dict-get :list) N)`. +7. `.hasMore()` → `(hs-stream-has-more )`. +8. `.type` / `.value` / `.op` → `(hs-token-type/value/op? )`. +9. `expect(X).toBe(V)` and `expect(X).toEqual({...})` → `assert=`. +10. `try { ... } catch (e) { errors.push(e.message) }` plus + `expect(msg).toMatch(/pat/)` → `(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`. + A tiny `guard-msg` helper runs the expr under `guard` and returns + the caught error's message. + +The generator should emit a new deftest prologue: + +``` + (deftest "" + (let ((s1 (hs-tokens-of "")) + (s2 (hs-tokens-of "" :template))) + (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER") + ...)) +``` + +When the test builds a `results` object/array of `{type, value}` +dicts, emit one `assert=` per field instead of materialising a dict — +simpler to debug when it fails. `toEqual({type: "X", value: "Y"})` +becomes two `assert=` lines. + +The generator continues to bail (`return None` / emit +`SKIP (untranslated)`) if any unrecognised JS shape appears; the 17 +bodies all fit the grammar above. + +## 6. Test delta estimate + +| # | Test | Feasible? | Blockers | +|---|------|-----------|----------| +| 1 | handles $ in template properly | yes | templateMode impl | +| 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) | +| 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) | +| 4 | handles class identifiers | yes | type-map + `.list[i]` access | +| 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add | +| 6 | handles hex escapes | yes | new `\xNN` escape + structured error | +| 7 | handles id references | yes | mirror of 4 | +| 8 | handles identifiers properly | yes | type-map only | +| 9 | handles identifiers with numbers | yes | type-map only | +| 10 | handles look ahead property | yes | EOF sentinel with `"<<>>"` value | +| 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK | +| 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead | +| 13 | handles strings properly | yes | structured unterminated-string error | +| 14 | handles strings properly 2 | yes | subset of 13 | +| 15 | handles template bootstrap | yes | templateMode lexical emission | +| 16 | handles whitespace properly | yes | type-map only | +| 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape | + +Expected: **+16 to +17**. Test 17 is already runnable (it is the one +non-eval-only case) but depends on template-escape handling that lives +in the same commit. + +## 7. Risks / open questions + +- **Position tracking.** The tokenizer currently stores `:pos P`. Tests + do not read it, so we leave it alone. E38 (SourceInfo API) will add + `start/end/line/column`; when that lands, `hs-raw->api-token` should + copy those through. +- **Template mode churn.** Introducing `hs-tokenize-template` risks + divergence from the main tokenizer. Mitigation: factor shared scan + helpers (whitespace, identifier, operator dispatch) into named + functions both variants call; keep the template variant a thin + wrapper that only overrides the backtick handler. +- **Keyword vs identifier type.** The current code tags reserved words + as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The + conformance tests always expect `IDENTIFIER`. Mapping both + `"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is + safe and does **not** affect the parser, which consumes the raw + stream, not the API stream. +- **Mutable streams.** The API stream is intentionally mutable (cursor + advances on `consumeToken`). SX dicts are mutable via `dict-set!` + today; this is consistent with the rest of the hyperscript runtime, + which uses mutable dicts in `hs-activate!` and the event loop. +- **Do any existing tests depend on token shape?** `parser.sx` reads + `:type :value :pos`. It must **not** see the API-shaped dicts. The + API is strictly additive — `hs-tokenize` is unchanged; `hs-parse` + continues to consume its output directly. Only `hs-api-tokens` + (and its consumers) sees the upstream-shaped dicts. +- **Error-message contract.** Upstream throws on unterminated strings + and bad hex escapes. We currently return an EOF and emit a + trailing fragment. Adding a thrown error is new behaviour; confirm + the parser callers in `hs-compile` still produce useful diagnostics + when the tokenizer raises rather than eats the input. +- **`.list` indexing semantics.** Upstream tests read `.list[3]` and + `.list[4]` directly — these indices reference upstream's raw + token layout. If our SX tokenizer emits a slightly different + layout (e.g. extra whitespace-related tokens, or none where + upstream has one), the index tests fail even though `.type`/`.value` + are correct. Verify on a spike before committing: run + `(hs-tokens-of "(a).a")` and check that index 4 is the + `CLASS_REF`. If indices disagree, add a normalization pass that + strips tokens upstream omits. + +## 8. Implementation checklist + +Ordered smallest-first; each is its own commit. + +1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`). + Includes `hs-raw->api-token`, type-map, op-name table, + `hs-stream-token/consume/has-more`, EOF sentinel with + `"<<>>"` value. No test delta yet — API-only. +2. **Extend string-escape table** in `read-string` (tokenizer): + add `\b \f \r \v \xNN`, keep existing `\n \t \\ `. Emit + structured error message `"Invalid hexadecimal escape: ..."` or + `"Unterminated string"`. Unlocks tests 2, 6, 13, 14. +3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!. + Currently only `[=!<>]=` is matched. Unlocks test 12. +4. **Add `--` line-comment support** to scan!. Currently only `//` + (through selector disambiguation) is handled. Unlocks test 5. +5. **Add `hs-tokenize-template`** variant for template-bootstrap + lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15. +6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`: + recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])` + + consumer chain, emit SX `deftest` using the helpers from step 1. + Unlocks the 16 remaining eval-only tests (test 17 already has DOM + shape). +7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run + `mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`. + Expected: 17/17, with test 17 also passing thanks to step 2's + escape fixes (it depends on `\$` / `\${` in `read-template`). +8. **Update** `plans/hs-conformance-to-100.md` row 37 to + `done (+17)` and tick the scoreboard in the same commit. + +Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`, +`tests/playwright/generate-sx-tests.py`, and the two plan files — +matching the scope rule in the conformance plan. `shared/static/wasm/ +sx/hs-runtime.sx` must be re-copied after each runtime edit.