Files
rose-ash/plans/designs/e37-tokenizer-api.md
giles 87cafaaa3f HS-design: E37 Tokenizer-as-API
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:08:02 +00:00

393 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# E37 — Tokenizer-as-API
Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in
`hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP
(untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS
bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and
inspect a token-stream surface the SX port does not expose.
Work breaks into: (1) an SX API over the existing `hs-tokenize`
mimicking the upstream stream object; (2) a compatibility shim over
token fields; (3) a generator pattern recognising
`_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No
tokenizer-grammar rewrite is required. Position tracking
(`start/end/line/column`) is scoped to E38 (SourceInfo API).
## 1. Failing tests
Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize`
plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`,
`.list`, `.type`, `.value`, `.op`.
1. **handles $ in template properly**`tokenize('"', true).token(0).value``'"'`. templateMode + `token(i)`.
2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`.
3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`.
4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`.
5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing.
6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`.
7. **handles id references** — mirror of 4 for `#a``ID_REF`.
8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls.
9. **handles identifiers with numbers**`f1oo / fo1o / foo1``IDENTIFIER`.
10. **handles look ahead property**`tokenize("a 1 + 1").token(0..4)``["a" "1" "+" "1" "<<<EOF>>>"]`.
11. **handles numbers properly** — 8 asserts incl. `1.1.1``NUMBER PERIOD NUMBER`.
12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`.
13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`.
14. **handles strings properly 2** — subset of 13.
15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`).
16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`.
17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API.
## 2. Upstream API shape
From `https://hyperscript.org/docs/#api` and
`node_modules/hyperscript.org/src/_hyperscript.js`:
```js
const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
// → { list, source, hasMore, matchTokenType, token, consumeToken,
// requireTokenType, ... }
tokens.list // Array<Token> — lookahead window
tokens.source // original src string
tokens.token(i) // i-th un-consumed token (0 = current); returns
// { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore() // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.
```
Each `Token` is:
```js
{
type: "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
| "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
value: string,
op: boolean, // true for punctuation/operator tokens
start: number, // char offset
end: number,
line: number,
column: number,
source: string, // reference to full src
}
```
The conformance tests only read `type`, `value`, `op`, and occasionally
random-index into `.list`. They never read `start/end/line/column`, so
position tracking is **not** required for cluster E37.
## 3. Proposed SX surface
Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so
SX test bodies can call them directly through `eval-hs` or `assert=`):
```
(hs-tokens-of src) ; => dict — new token-stream object
(hs-tokens-of src :template) ; templateMode variant
(hs-token-type tok) ; upstream-style type name
(hs-token-value tok) ; string value
(hs-token-op? tok) ; bool
```
A token stream is a mutable dict:
```
{ :source src
:list (list-of-tokens) ; upstream-shaped, :type :value :op
:pos 0 } ; cursor into :list
```
With three pure-SX consumer helpers:
```
(hs-stream-token stream i) ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream) ; returns current token, advances :pos
(hs-stream-has-more stream) ; not EOF and pos < len
```
### Worked example
```
(let ((s (hs-tokens-of "1.1")))
(hs-token-type (hs-stream-consume s))) ; => "NUMBER"
(let ((s (hs-tokens-of "a 1 + 1")))
(list (hs-token-value (hs-stream-token s 0)) ; "a"
(hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"
```
All helpers are ordinary `define`s — no platform primitives, no FFI.
The generator emits them as bare calls inside `deftest` bodies.
## 4. Runtime architecture
The existing `hs-tokenize` emits tokens with:
```
{ :type "keyword" | "ident" | "number" | "string" | "class" | "id"
| "op" | "paren-open" | ... | "eof"
:value V
:pos P }
```
The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated
boolean `.op` flag rather than a merged type/punctuation taxonomy.
Rather than rewrite the tokenizer, add a translation layer.
### Type map (SX-native → upstream)
```
"ident" → "IDENTIFIER" (keywords too: see note)
"keyword" → "IDENTIFIER" (upstream tokenizes keywords as idents)
"number" → "NUMBER"
"string" → "STRING"
"class" → "CLASS_REF" (:value becomes ".a" with leading dot)
"id" → "ID_REF" (:value becomes "#a" with leading hash)
"attr" → "ATTRIBUTE_REF"
"style" → "STYLE_REF"
"selector" → "QUERY_REF" (used by tests? upstream calls it QUERY_REF)
"template" → one-shot: see templateMode below
"eof" → "EOF" with :value "<<<EOF>>>"
"paren-open" → "L_PAREN" + :op true
"paren-close" → "R_PAREN" + :op true
"bracket-open" → "L_BRACKET" + :op true
"bracket-close" → "R_BRACKET" + :op true
"brace-open" → "L_BRACE" + :op true
"brace-close" → "R_BRACE" + :op true
"comma" → "COMMA" + :op true
"dot" → "PERIOD" + :op true
"op" → name-by-value lookup (see below) + :op true
```
A tiny op-name table (1525 entries) maps `:value` strings to the
upstream token type name:
```
"+" → "PLUS"
"-" → "MINUS"
"*" → "MULTIPLY"
"/" → "SLASH" ; current code uses "op"/"/"
":" → "COLON" ; not yet emitted as own token — fix below
"%" → "PERCENT"
"|" → "PIPE"
"!" → "EXCLAMATION"
"?" → "QUESTION"
"#" → "POUND"
"&" → "AMPERSAND"
";" → "SEMI"
"=" → "EQUALS"
"<" → "L_ANG"
">" → "R_ANG"
"<=" → "LTE_ANG"
">=" → "GTE_ANG"
"==" → "EQ"
"===" → "EQQ"
"\\" → "BACKSLASH"
"'s" → "APOSTROPHE_S" ; not a true operator — elided from test 12
```
### Conversion entry point
```
(define (hs-api-tokens src template-mode?)
(let ((raw (if template-mode?
(hs-tokenize-template src) ; new variant
(hs-tokenize src))))
{:source src
:list (map hs-raw->api-token raw)
:pos 0}))
```
`hs-raw->api-token` is a pure mapping function using the tables above.
An EOF token is always present at the end (the current tokenizer
already emits one).
### Token gaps to fix
Three issues turn up while writing the map; all are trivial one-site
fixes in `tokenizer.sx`:
- **`:` is currently consumed as part of the local prefix
(`:name`)**. Upstream tests expect bare `:` alone to produce
`COLON`; only when followed by `ident-start` does it combine. The
test suite does not exercise the bare form (it is only covered by
the operator table in test 12). Fix by emitting `"op" ":"` when the
next char is not an ident start — already what the code does; the
op-name map above covers it.
- **`===` and `==`** — current tokenizer emits `"op" "="` plus another
`"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to
also match a third `=` after `==`.
- **Template mode** — upstream `tokenize(src, true)` splits
backtick-templates into their lexical parts rather than the single
`"template"` token the current code emits. Add a second top-level
scanner `hs-tokenize-template` used only for the API wrapper; the
primary parser continues to call `hs-tokenize` unchanged. The
template-mode tests (1, 15) only require character-level emission
of the `" $ { inner } "` sequence — no semantic re-use by the
parser.
### Stream consumer helpers
```
(define (hs-stream-token s i)
(let ((list (dict-get s :list))
(pos (dict-get s :pos)))
(or (nth list (+ pos i))
(hs-eof-sentinel))))
(define (hs-stream-consume s)
(let ((tok (hs-stream-token s 0)))
(when (not (= (hs-token-type tok) "EOF"))
(dict-set! s :pos (+ (dict-get s :pos) 1)))
tok))
(define (hs-stream-has-more s)
(not (= (hs-token-type (hs-stream-token s 0)) "EOF")))
```
## 5. Test mock strategy
All 17 tests are `complexity: eval-only` with empty `html`. They do
not need the DOM runner — they only need SX expressions that resolve
to the same values the JS asserts check.
Add a generator pattern to `generate-sx-tests.py`, slotted into
`generate_eval_only_test` or as a new pre-pass ahead of it, that
matches bodies containing `_hyperscript.internals.tokenizer.tokenize`.
The pattern tree, by precedence:
1. `tokenize(SRC[, true])` → emit an SX `let` that binds a
fresh stream name to `(hs-tokens-of SRC [:template])`.
2. `<stream>.consumeToken()``(hs-stream-consume <stream>)`.
3. `<stream>.token(N)``(hs-stream-token <stream> N)`.
4. `<stream>.list``(dict-get <stream> :list)`.
5. `<stream>.list.length``(len (dict-get <stream> :list))`.
6. `<stream>.list[N]``(nth (dict-get <stream> :list) N)`.
7. `<stream>.hasMore()``(hs-stream-has-more <stream>)`.
8. `<tok>.type` / `.value` / `.op``(hs-token-type/value/op? <tok>)`.
9. `expect(X).toBe(V)` and `expect(X).toEqual({...})``assert=`.
10. `try { ... } catch (e) { errors.push(e.message) }` plus
`expect(msg).toMatch(/pat/)``(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`.
A tiny `guard-msg` helper runs the expr under `guard` and returns
the caught error's message.
The generator should emit a new deftest prologue:
```
(deftest "<name>"
(let ((s1 (hs-tokens-of "<src1>"))
(s2 (hs-tokens-of "<src2>" :template)))
(assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
...))
```
When the test builds a `results` object/array of `{type, value}`
dicts, emit one `assert=` per field instead of materialising a dict —
simpler to debug when it fails. `toEqual({type: "X", value: "Y"})`
becomes two `assert=` lines.
The generator continues to bail (`return None` / emit
`SKIP (untranslated)`) if any unrecognised JS shape appears; the 17
bodies all fit the grammar above.
## 6. Test delta estimate
| # | Test | Feasible? | Blockers |
|---|------|-----------|----------|
| 1 | handles $ in template properly | yes | templateMode impl |
| 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) |
| 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) |
| 4 | handles class identifiers | yes | type-map + `.list[i]` access |
| 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add |
| 6 | handles hex escapes | yes | new `\xNN` escape + structured error |
| 7 | handles id references | yes | mirror of 4 |
| 8 | handles identifiers properly | yes | type-map only |
| 9 | handles identifiers with numbers | yes | type-map only |
| 10 | handles look ahead property | yes | EOF sentinel with `"<<<EOF>>>"` value |
| 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK |
| 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead |
| 13 | handles strings properly | yes | structured unterminated-string error |
| 14 | handles strings properly 2 | yes | subset of 13 |
| 15 | handles template bootstrap | yes | templateMode lexical emission |
| 16 | handles whitespace properly | yes | type-map only |
| 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape |
Expected: **+16 to +17**. Test 17 is already runnable (it is the one
non-eval-only case) but depends on template-escape handling that lives
in the same commit.
## 7. Risks / open questions
- **Position tracking.** The tokenizer currently stores `:pos P`. Tests
do not read it, so we leave it alone. E38 (SourceInfo API) will add
`start/end/line/column`; when that lands, `hs-raw->api-token` should
copy those through.
- **Template mode churn.** Introducing `hs-tokenize-template` risks
divergence from the main tokenizer. Mitigation: factor shared scan
helpers (whitespace, identifier, operator dispatch) into named
functions both variants call; keep the template variant a thin
wrapper that only overrides the backtick handler.
- **Keyword vs identifier type.** The current code tags reserved words
as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The
conformance tests always expect `IDENTIFIER`. Mapping both
`"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is
safe and does **not** affect the parser, which consumes the raw
stream, not the API stream.
- **Mutable streams.** The API stream is intentionally mutable (cursor
advances on `consumeToken`). SX dicts are mutable via `dict-set!`
today; this is consistent with the rest of the hyperscript runtime,
which uses mutable dicts in `hs-activate!` and the event loop.
- **Do any existing tests depend on token shape?** `parser.sx` reads
`:type :value :pos`. It must **not** see the API-shaped dicts. The
API is strictly additive — `hs-tokenize` is unchanged; `hs-parse`
continues to consume its output directly. Only `hs-api-tokens`
(and its consumers) sees the upstream-shaped dicts.
- **Error-message contract.** Upstream throws on unterminated strings
and bad hex escapes. We currently return an EOF and emit a
trailing fragment. Adding a thrown error is new behaviour; confirm
the parser callers in `hs-compile` still produce useful diagnostics
when the tokenizer raises rather than eats the input.
- **`.list` indexing semantics.** Upstream tests read `.list[3]` and
`.list[4]` directly — these indices reference upstream's raw
token layout. If our SX tokenizer emits a slightly different
layout (e.g. extra whitespace-related tokens, or none where
upstream has one), the index tests fail even though `.type`/`.value`
are correct. Verify on a spike before committing: run
`(hs-tokens-of "(a).a")` and check that index 4 is the
`CLASS_REF`. If indices disagree, add a normalization pass that
strips tokens upstream omits.
## 8. Implementation checklist
Ordered smallest-first; each is its own commit.
1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`).
Includes `hs-raw->api-token`, type-map, op-name table,
`hs-stream-token/consume/has-more`, EOF sentinel with
`"<<<EOF>>>"` value. No test delta yet — API-only.
2. **Extend string-escape table** in `read-string` (tokenizer):
add `\b \f \r \v \xNN`, keep existing `\n \t \\ <quote>`. Emit
structured error message `"Invalid hexadecimal escape: ..."` or
`"Unterminated string"`. Unlocks tests 2, 6, 13, 14.
3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!.
Currently only `[=!<>]=` is matched. Unlocks test 12.
4. **Add `--` line-comment support** to scan!. Currently only `//`
(through selector disambiguation) is handled. Unlocks test 5.
5. **Add `hs-tokenize-template`** variant for template-bootstrap
lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`:
recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])`
+ consumer chain, emit SX `deftest` using the helpers from step 1.
Unlocks the 16 remaining eval-only tests (test 17 already has DOM
shape).
7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run
`mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`.
Expected: 17/17, with test 17 also passing thanks to step 2's
escape fixes (it depends on `\$` / `\${` in `read-template`).
8. **Update** `plans/hs-conformance-to-100.md` row 37 to
`done (+17)` and tick the scoreboard in the same commit.
Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`,
`tests/playwright/generate-sx-tests.py`, and the two plan files —
matching the scope rule in the conformance plan. `shared/static/wasm/
sx/hs-runtime.sx` must be re-copied after each runtime edit.