# E37 — Tokenizer-as-API Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in `hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP (untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and inspect a token-stream surface the SX port does not expose. Work breaks into: (1) an SX API over the existing `hs-tokenize` mimicking the upstream stream object; (2) a compatibility shim over token fields; (3) a generator pattern recognising `_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No tokenizer-grammar rewrite is required. Position tracking (`start/end/line/column`) is scoped to E38 (SourceInfo API). ## 1. Failing tests Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize` plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`, `.list`, `.type`, `.value`, `.op`. 1. **handles $ in template properly** — `tokenize('"', true).token(0).value` → `'"'`. templateMode + `token(i)`. 2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`. 3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`. 4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`. 5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing. 6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`. 7. **handles id references** — mirror of 4 for `#a` → `ID_REF`. 8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls. 9. **handles identifiers with numbers** — `f1oo / fo1o / foo1` → `IDENTIFIER`. 10. **handles look ahead property** — `tokenize("a 1 + 1").token(0..4)` → `["a" "1" "+" "1" "<<>>"]`. 11. **handles numbers properly** — 8 asserts incl. `1.1.1` → `NUMBER PERIOD NUMBER`. 12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`. 13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`. 14. **handles strings properly 2** — subset of 13. 15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`). 16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`. 17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API. ## 2. Upstream API shape From `https://hyperscript.org/docs/#api` and `node_modules/hyperscript.org/src/_hyperscript.js`: ```js const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?) // → { list, source, hasMore, matchTokenType, token, consumeToken, // requireTokenType, ... } tokens.list // Array — lookahead window tokens.source // original src string tokens.token(i) // i-th un-consumed token (0 = current); returns // { type: "EOF", value: "<<>>" } past end tokens.consumeToken() // shift + return; throws on empty for required tokens.hasMore() // true if a non-EOF token remains tokens.matchTokenType(type) / requireTokenType(type) / etc. ``` Each `Token` is: ```js { type: "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF" | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */, value: string, op: boolean, // true for punctuation/operator tokens start: number, // char offset end: number, line: number, column: number, source: string, // reference to full src } ``` The conformance tests only read `type`, `value`, `op`, and occasionally random-index into `.list`. They never read `start/end/line/column`, so position tracking is **not** required for cluster E37. ## 3. Proposed SX surface Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so SX test bodies can call them directly through `eval-hs` or `assert=`): ``` (hs-tokens-of src) ; => dict — new token-stream object (hs-tokens-of src :template) ; templateMode variant (hs-token-type tok) ; upstream-style type name (hs-token-value tok) ; string value (hs-token-op? tok) ; bool ``` A token stream is a mutable dict: ``` { :source src :list (list-of-tokens) ; upstream-shaped, :type :value :op :pos 0 } ; cursor into :list ``` With three pure-SX consumer helpers: ``` (hs-stream-token stream i) ; lookahead; returns EOF sentinel past end (hs-stream-consume stream) ; returns current token, advances :pos (hs-stream-has-more stream) ; not EOF and pos < len ``` ### Worked example ``` (let ((s (hs-tokens-of "1.1"))) (hs-token-type (hs-stream-consume s))) ; => "NUMBER" (let ((s (hs-tokens-of "a 1 + 1"))) (list (hs-token-value (hs-stream-token s 0)) ; "a" (hs-token-value (hs-stream-token s 4)))) ; "<<>>" ``` All helpers are ordinary `define`s — no platform primitives, no FFI. The generator emits them as bare calls inside `deftest` bodies. ## 4. Runtime architecture The existing `hs-tokenize` emits tokens with: ``` { :type "keyword" | "ident" | "number" | "string" | "class" | "id" | "op" | "paren-open" | ... | "eof" :value V :pos P } ``` The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated boolean `.op` flag rather than a merged type/punctuation taxonomy. Rather than rewrite the tokenizer, add a translation layer. ### Type map (SX-native → upstream) ``` "ident" → "IDENTIFIER" (keywords too: see note) "keyword" → "IDENTIFIER" (upstream tokenizes keywords as idents) "number" → "NUMBER" "string" → "STRING" "class" → "CLASS_REF" (:value becomes ".a" with leading dot) "id" → "ID_REF" (:value becomes "#a" with leading hash) "attr" → "ATTRIBUTE_REF" "style" → "STYLE_REF" "selector" → "QUERY_REF" (used by tests? upstream calls it QUERY_REF) "template" → one-shot: see templateMode below "eof" → "EOF" with :value "<<>>" "paren-open" → "L_PAREN" + :op true "paren-close" → "R_PAREN" + :op true "bracket-open" → "L_BRACKET" + :op true "bracket-close" → "R_BRACKET" + :op true "brace-open" → "L_BRACE" + :op true "brace-close" → "R_BRACE" + :op true "comma" → "COMMA" + :op true "dot" → "PERIOD" + :op true "op" → name-by-value lookup (see below) + :op true ``` A tiny op-name table (15–25 entries) maps `:value` strings to the upstream token type name: ``` "+" → "PLUS" "-" → "MINUS" "*" → "MULTIPLY" "/" → "SLASH" ; current code uses "op"/"/" ":" → "COLON" ; not yet emitted as own token — fix below "%" → "PERCENT" "|" → "PIPE" "!" → "EXCLAMATION" "?" → "QUESTION" "#" → "POUND" "&" → "AMPERSAND" ";" → "SEMI" "=" → "EQUALS" "<" → "L_ANG" ">" → "R_ANG" "<=" → "LTE_ANG" ">=" → "GTE_ANG" "==" → "EQ" "===" → "EQQ" "\\" → "BACKSLASH" "'s" → "APOSTROPHE_S" ; not a true operator — elided from test 12 ``` ### Conversion entry point ``` (define (hs-api-tokens src template-mode?) (let ((raw (if template-mode? (hs-tokenize-template src) ; new variant (hs-tokenize src)))) {:source src :list (map hs-raw->api-token raw) :pos 0})) ``` `hs-raw->api-token` is a pure mapping function using the tables above. An EOF token is always present at the end (the current tokenizer already emits one). ### Token gaps to fix Three issues turn up while writing the map; all are trivial one-site fixes in `tokenizer.sx`: - **`:` is currently consumed as part of the local prefix (`:name`)**. Upstream tests expect bare `:` alone to produce `COLON`; only when followed by `ident-start` does it combine. The test suite does not exercise the bare form (it is only covered by the operator table in test 12). Fix by emitting `"op" ":"` when the next char is not an ident start — already what the code does; the op-name map above covers it. - **`===` and `==`** — current tokenizer emits `"op" "="` plus another `"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to also match a third `=` after `==`. - **Template mode** — upstream `tokenize(src, true)` splits backtick-templates into their lexical parts rather than the single `"template"` token the current code emits. Add a second top-level scanner `hs-tokenize-template` used only for the API wrapper; the primary parser continues to call `hs-tokenize` unchanged. The template-mode tests (1, 15) only require character-level emission of the `" $ { inner } "` sequence — no semantic re-use by the parser. ### Stream consumer helpers ``` (define (hs-stream-token s i) (let ((list (dict-get s :list)) (pos (dict-get s :pos))) (or (nth list (+ pos i)) (hs-eof-sentinel)))) (define (hs-stream-consume s) (let ((tok (hs-stream-token s 0))) (when (not (= (hs-token-type tok) "EOF")) (dict-set! s :pos (+ (dict-get s :pos) 1))) tok)) (define (hs-stream-has-more s) (not (= (hs-token-type (hs-stream-token s 0)) "EOF"))) ``` ## 5. Test mock strategy All 17 tests are `complexity: eval-only` with empty `html`. They do not need the DOM runner — they only need SX expressions that resolve to the same values the JS asserts check. Add a generator pattern to `generate-sx-tests.py`, slotted into `generate_eval_only_test` or as a new pre-pass ahead of it, that matches bodies containing `_hyperscript.internals.tokenizer.tokenize`. The pattern tree, by precedence: 1. `tokenize(SRC[, true])` → emit an SX `let` that binds a fresh stream name to `(hs-tokens-of SRC [:template])`. 2. `.consumeToken()` → `(hs-stream-consume )`. 3. `.token(N)` → `(hs-stream-token N)`. 4. `.list` → `(dict-get :list)`. 5. `.list.length` → `(len (dict-get :list))`. 6. `.list[N]` → `(nth (dict-get :list) N)`. 7. `.hasMore()` → `(hs-stream-has-more )`. 8. `.type` / `.value` / `.op` → `(hs-token-type/value/op? )`. 9. `expect(X).toBe(V)` and `expect(X).toEqual({...})` → `assert=`. 10. `try { ... } catch (e) { errors.push(e.message) }` plus `expect(msg).toMatch(/pat/)` → `(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`. A tiny `guard-msg` helper runs the expr under `guard` and returns the caught error's message. The generator should emit a new deftest prologue: ``` (deftest "" (let ((s1 (hs-tokens-of "")) (s2 (hs-tokens-of "" :template))) (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER") ...)) ``` When the test builds a `results` object/array of `{type, value}` dicts, emit one `assert=` per field instead of materialising a dict — simpler to debug when it fails. `toEqual({type: "X", value: "Y"})` becomes two `assert=` lines. The generator continues to bail (`return None` / emit `SKIP (untranslated)`) if any unrecognised JS shape appears; the 17 bodies all fit the grammar above. ## 6. Test delta estimate | # | Test | Feasible? | Blockers | |---|------|-----------|----------| | 1 | handles $ in template properly | yes | templateMode impl | | 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) | | 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) | | 4 | handles class identifiers | yes | type-map + `.list[i]` access | | 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add | | 6 | handles hex escapes | yes | new `\xNN` escape + structured error | | 7 | handles id references | yes | mirror of 4 | | 8 | handles identifiers properly | yes | type-map only | | 9 | handles identifiers with numbers | yes | type-map only | | 10 | handles look ahead property | yes | EOF sentinel with `"<<>>"` value | | 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK | | 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead | | 13 | handles strings properly | yes | structured unterminated-string error | | 14 | handles strings properly 2 | yes | subset of 13 | | 15 | handles template bootstrap | yes | templateMode lexical emission | | 16 | handles whitespace properly | yes | type-map only | | 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape | Expected: **+16 to +17**. Test 17 is already runnable (it is the one non-eval-only case) but depends on template-escape handling that lives in the same commit. ## 7. Risks / open questions - **Position tracking.** The tokenizer currently stores `:pos P`. Tests do not read it, so we leave it alone. E38 (SourceInfo API) will add `start/end/line/column`; when that lands, `hs-raw->api-token` should copy those through. - **Template mode churn.** Introducing `hs-tokenize-template` risks divergence from the main tokenizer. Mitigation: factor shared scan helpers (whitespace, identifier, operator dispatch) into named functions both variants call; keep the template variant a thin wrapper that only overrides the backtick handler. - **Keyword vs identifier type.** The current code tags reserved words as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The conformance tests always expect `IDENTIFIER`. Mapping both `"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is safe and does **not** affect the parser, which consumes the raw stream, not the API stream. - **Mutable streams.** The API stream is intentionally mutable (cursor advances on `consumeToken`). SX dicts are mutable via `dict-set!` today; this is consistent with the rest of the hyperscript runtime, which uses mutable dicts in `hs-activate!` and the event loop. - **Do any existing tests depend on token shape?** `parser.sx` reads `:type :value :pos`. It must **not** see the API-shaped dicts. The API is strictly additive — `hs-tokenize` is unchanged; `hs-parse` continues to consume its output directly. Only `hs-api-tokens` (and its consumers) sees the upstream-shaped dicts. - **Error-message contract.** Upstream throws on unterminated strings and bad hex escapes. We currently return an EOF and emit a trailing fragment. Adding a thrown error is new behaviour; confirm the parser callers in `hs-compile` still produce useful diagnostics when the tokenizer raises rather than eats the input. - **`.list` indexing semantics.** Upstream tests read `.list[3]` and `.list[4]` directly — these indices reference upstream's raw token layout. If our SX tokenizer emits a slightly different layout (e.g. extra whitespace-related tokens, or none where upstream has one), the index tests fail even though `.type`/`.value` are correct. Verify on a spike before committing: run `(hs-tokens-of "(a).a")` and check that index 4 is the `CLASS_REF`. If indices disagree, add a normalization pass that strips tokens upstream omits. ## 8. Implementation checklist Ordered smallest-first; each is its own commit. 1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`). Includes `hs-raw->api-token`, type-map, op-name table, `hs-stream-token/consume/has-more`, EOF sentinel with `"<<>>"` value. No test delta yet — API-only. 2. **Extend string-escape table** in `read-string` (tokenizer): add `\b \f \r \v \xNN`, keep existing `\n \t \\ `. Emit structured error message `"Invalid hexadecimal escape: ..."` or `"Unterminated string"`. Unlocks tests 2, 6, 13, 14. 3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!. Currently only `[=!<>]=` is matched. Unlocks test 12. 4. **Add `--` line-comment support** to scan!. Currently only `//` (through selector disambiguation) is handled. Unlocks test 5. 5. **Add `hs-tokenize-template`** variant for template-bootstrap lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15. 6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`: recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])` + consumer chain, emit SX `deftest` using the helpers from step 1. Unlocks the 16 remaining eval-only tests (test 17 already has DOM shape). 7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run `mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`. Expected: 17/17, with test 17 also passing thanks to step 2's escape fixes (it depends on `\$` / `\${` in `read-template`). 8. **Update** `plans/hs-conformance-to-100.md` row 37 to `done (+17)` and tick the scoreboard in the same commit. Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`, `tests/playwright/generate-sx-tests.py`, and the two plan files — matching the scope rule in the conformance plan. `shared/static/wasm/ sx/hs-runtime.sx` must be re-copied after each runtime edit.