HS-design: E37 Tokenizer-as-API

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-24 06:58:36 +00:00
parent 3587443742
commit 87cafaaa3f

View File

@@ -0,0 +1,392 @@
# E37 — Tokenizer-as-API
Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in
`hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP
(untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS
bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and
inspect a token-stream surface the SX port does not expose.
Work breaks into: (1) an SX API over the existing `hs-tokenize`
mimicking the upstream stream object; (2) a compatibility shim over
token fields; (3) a generator pattern recognising
`_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No
tokenizer-grammar rewrite is required. Position tracking
(`start/end/line/column`) is scoped to E38 (SourceInfo API).
## 1. Failing tests
Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize`
plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`,
`.list`, `.type`, `.value`, `.op`.
1. **handles $ in template properly**`tokenize('"', true).token(0).value``'"'`. templateMode + `token(i)`.
2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`.
3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`.
4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`.
5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing.
6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`.
7. **handles id references** — mirror of 4 for `#a``ID_REF`.
8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls.
9. **handles identifiers with numbers**`f1oo / fo1o / foo1``IDENTIFIER`.
10. **handles look ahead property**`tokenize("a 1 + 1").token(0..4)``["a" "1" "+" "1" "<<<EOF>>>"]`.
11. **handles numbers properly** — 8 asserts incl. `1.1.1``NUMBER PERIOD NUMBER`.
12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`.
13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`.
14. **handles strings properly 2** — subset of 13.
15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`).
16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`.
17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API.
## 2. Upstream API shape
From `https://hyperscript.org/docs/#api` and
`node_modules/hyperscript.org/src/_hyperscript.js`:
```js
const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
// → { list, source, hasMore, matchTokenType, token, consumeToken,
// requireTokenType, ... }
tokens.list // Array<Token> — lookahead window
tokens.source // original src string
tokens.token(i) // i-th un-consumed token (0 = current); returns
// { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore() // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.
```
Each `Token` is:
```js
{
type: "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
| "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
value: string,
op: boolean, // true for punctuation/operator tokens
start: number, // char offset
end: number,
line: number,
column: number,
source: string, // reference to full src
}
```
The conformance tests only read `type`, `value`, `op`, and occasionally
random-index into `.list`. They never read `start/end/line/column`, so
position tracking is **not** required for cluster E37.
## 3. Proposed SX surface
Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so
SX test bodies can call them directly through `eval-hs` or `assert=`):
```
(hs-tokens-of src) ; => dict — new token-stream object
(hs-tokens-of src :template) ; templateMode variant
(hs-token-type tok) ; upstream-style type name
(hs-token-value tok) ; string value
(hs-token-op? tok) ; bool
```
A token stream is a mutable dict:
```
{ :source src
:list (list-of-tokens) ; upstream-shaped, :type :value :op
:pos 0 } ; cursor into :list
```
With three pure-SX consumer helpers:
```
(hs-stream-token stream i) ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream) ; returns current token, advances :pos
(hs-stream-has-more stream) ; not EOF and pos < len
```
### Worked example
```
(let ((s (hs-tokens-of "1.1")))
(hs-token-type (hs-stream-consume s))) ; => "NUMBER"
(let ((s (hs-tokens-of "a 1 + 1")))
(list (hs-token-value (hs-stream-token s 0)) ; "a"
(hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"
```
All helpers are ordinary `define`s — no platform primitives, no FFI.
The generator emits them as bare calls inside `deftest` bodies.
## 4. Runtime architecture
The existing `hs-tokenize` emits tokens with:
```
{ :type "keyword" | "ident" | "number" | "string" | "class" | "id"
| "op" | "paren-open" | ... | "eof"
:value V
:pos P }
```
The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated
boolean `.op` flag rather than a merged type/punctuation taxonomy.
Rather than rewrite the tokenizer, add a translation layer.
### Type map (SX-native → upstream)
```
"ident" → "IDENTIFIER" (keywords too: see note)
"keyword" → "IDENTIFIER" (upstream tokenizes keywords as idents)
"number" → "NUMBER"
"string" → "STRING"
"class" → "CLASS_REF" (:value becomes ".a" with leading dot)
"id" → "ID_REF" (:value becomes "#a" with leading hash)
"attr" → "ATTRIBUTE_REF"
"style" → "STYLE_REF"
"selector" → "QUERY_REF" (used by tests? upstream calls it QUERY_REF)
"template" → one-shot: see templateMode below
"eof" → "EOF" with :value "<<<EOF>>>"
"paren-open" → "L_PAREN" + :op true
"paren-close" → "R_PAREN" + :op true
"bracket-open" → "L_BRACKET" + :op true
"bracket-close" → "R_BRACKET" + :op true
"brace-open" → "L_BRACE" + :op true
"brace-close" → "R_BRACE" + :op true
"comma" → "COMMA" + :op true
"dot" → "PERIOD" + :op true
"op" → name-by-value lookup (see below) + :op true
```
A tiny op-name table (1525 entries) maps `:value` strings to the
upstream token type name:
```
"+" → "PLUS"
"-" → "MINUS"
"*" → "MULTIPLY"
"/" → "SLASH" ; current code uses "op"/"/"
":" → "COLON" ; not yet emitted as own token — fix below
"%" → "PERCENT"
"|" → "PIPE"
"!" → "EXCLAMATION"
"?" → "QUESTION"
"#" → "POUND"
"&" → "AMPERSAND"
";" → "SEMI"
"=" → "EQUALS"
"<" → "L_ANG"
">" → "R_ANG"
"<=" → "LTE_ANG"
">=" → "GTE_ANG"
"==" → "EQ"
"===" → "EQQ"
"\\" → "BACKSLASH"
"'s" → "APOSTROPHE_S" ; not a true operator — elided from test 12
```
### Conversion entry point
```
(define (hs-api-tokens src template-mode?)
(let ((raw (if template-mode?
(hs-tokenize-template src) ; new variant
(hs-tokenize src))))
{:source src
:list (map hs-raw->api-token raw)
:pos 0}))
```
`hs-raw->api-token` is a pure mapping function using the tables above.
An EOF token is always present at the end (the current tokenizer
already emits one).
### Token gaps to fix
Three issues turn up while writing the map; all are trivial one-site
fixes in `tokenizer.sx`:
- **`:` is currently consumed as part of the local prefix
(`:name`)**. Upstream tests expect bare `:` alone to produce
`COLON`; only when followed by `ident-start` does it combine. The
test suite does not exercise the bare form (it is only covered by
the operator table in test 12). Fix by emitting `"op" ":"` when the
next char is not an ident start — already what the code does; the
op-name map above covers it.
- **`===` and `==`** — current tokenizer emits `"op" "="` plus another
`"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to
also match a third `=` after `==`.
- **Template mode** — upstream `tokenize(src, true)` splits
backtick-templates into their lexical parts rather than the single
`"template"` token the current code emits. Add a second top-level
scanner `hs-tokenize-template` used only for the API wrapper; the
primary parser continues to call `hs-tokenize` unchanged. The
template-mode tests (1, 15) only require character-level emission
of the `" $ { inner } "` sequence — no semantic re-use by the
parser.
### Stream consumer helpers
```
(define (hs-stream-token s i)
(let ((list (dict-get s :list))
(pos (dict-get s :pos)))
(or (nth list (+ pos i))
(hs-eof-sentinel))))
(define (hs-stream-consume s)
(let ((tok (hs-stream-token s 0)))
(when (not (= (hs-token-type tok) "EOF"))
(dict-set! s :pos (+ (dict-get s :pos) 1)))
tok))
(define (hs-stream-has-more s)
(not (= (hs-token-type (hs-stream-token s 0)) "EOF")))
```
## 5. Test mock strategy
All 17 tests are `complexity: eval-only` with empty `html`. They do
not need the DOM runner — they only need SX expressions that resolve
to the same values the JS asserts check.
Add a generator pattern to `generate-sx-tests.py`, slotted into
`generate_eval_only_test` or as a new pre-pass ahead of it, that
matches bodies containing `_hyperscript.internals.tokenizer.tokenize`.
The pattern tree, by precedence:
1. `tokenize(SRC[, true])` → emit an SX `let` that binds a
fresh stream name to `(hs-tokens-of SRC [:template])`.
2. `<stream>.consumeToken()``(hs-stream-consume <stream>)`.
3. `<stream>.token(N)``(hs-stream-token <stream> N)`.
4. `<stream>.list``(dict-get <stream> :list)`.
5. `<stream>.list.length``(len (dict-get <stream> :list))`.
6. `<stream>.list[N]``(nth (dict-get <stream> :list) N)`.
7. `<stream>.hasMore()``(hs-stream-has-more <stream>)`.
8. `<tok>.type` / `.value` / `.op``(hs-token-type/value/op? <tok>)`.
9. `expect(X).toBe(V)` and `expect(X).toEqual({...})``assert=`.
10. `try { ... } catch (e) { errors.push(e.message) }` plus
`expect(msg).toMatch(/pat/)``(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`.
A tiny `guard-msg` helper runs the expr under `guard` and returns
the caught error's message.
The generator should emit a new deftest prologue:
```
(deftest "<name>"
(let ((s1 (hs-tokens-of "<src1>"))
(s2 (hs-tokens-of "<src2>" :template)))
(assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
...))
```
When the test builds a `results` object/array of `{type, value}`
dicts, emit one `assert=` per field instead of materialising a dict —
simpler to debug when it fails. `toEqual({type: "X", value: "Y"})`
becomes two `assert=` lines.
The generator continues to bail (`return None` / emit
`SKIP (untranslated)`) if any unrecognised JS shape appears; the 17
bodies all fit the grammar above.
## 6. Test delta estimate
| # | Test | Feasible? | Blockers |
|---|------|-----------|----------|
| 1 | handles $ in template properly | yes | templateMode impl |
| 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) |
| 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) |
| 4 | handles class identifiers | yes | type-map + `.list[i]` access |
| 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add |
| 6 | handles hex escapes | yes | new `\xNN` escape + structured error |
| 7 | handles id references | yes | mirror of 4 |
| 8 | handles identifiers properly | yes | type-map only |
| 9 | handles identifiers with numbers | yes | type-map only |
| 10 | handles look ahead property | yes | EOF sentinel with `"<<<EOF>>>"` value |
| 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK |
| 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead |
| 13 | handles strings properly | yes | structured unterminated-string error |
| 14 | handles strings properly 2 | yes | subset of 13 |
| 15 | handles template bootstrap | yes | templateMode lexical emission |
| 16 | handles whitespace properly | yes | type-map only |
| 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape |
Expected: **+16 to +17**. Test 17 is already runnable (it is the one
non-eval-only case) but depends on template-escape handling that lives
in the same commit.
## 7. Risks / open questions
- **Position tracking.** The tokenizer currently stores `:pos P`. Tests
do not read it, so we leave it alone. E38 (SourceInfo API) will add
`start/end/line/column`; when that lands, `hs-raw->api-token` should
copy those through.
- **Template mode churn.** Introducing `hs-tokenize-template` risks
divergence from the main tokenizer. Mitigation: factor shared scan
helpers (whitespace, identifier, operator dispatch) into named
functions both variants call; keep the template variant a thin
wrapper that only overrides the backtick handler.
- **Keyword vs identifier type.** The current code tags reserved words
as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The
conformance tests always expect `IDENTIFIER`. Mapping both
`"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is
safe and does **not** affect the parser, which consumes the raw
stream, not the API stream.
- **Mutable streams.** The API stream is intentionally mutable (cursor
advances on `consumeToken`). SX dicts are mutable via `dict-set!`
today; this is consistent with the rest of the hyperscript runtime,
which uses mutable dicts in `hs-activate!` and the event loop.
- **Do any existing tests depend on token shape?** `parser.sx` reads
`:type :value :pos`. It must **not** see the API-shaped dicts. The
API is strictly additive — `hs-tokenize` is unchanged; `hs-parse`
continues to consume its output directly. Only `hs-api-tokens`
(and its consumers) sees the upstream-shaped dicts.
- **Error-message contract.** Upstream throws on unterminated strings
and bad hex escapes. We currently return an EOF and emit a
trailing fragment. Adding a thrown error is new behaviour; confirm
the parser callers in `hs-compile` still produce useful diagnostics
when the tokenizer raises rather than eats the input.
- **`.list` indexing semantics.** Upstream tests read `.list[3]` and
`.list[4]` directly — these indices reference upstream's raw
token layout. If our SX tokenizer emits a slightly different
layout (e.g. extra whitespace-related tokens, or none where
upstream has one), the index tests fail even though `.type`/`.value`
are correct. Verify on a spike before committing: run
`(hs-tokens-of "(a).a")` and check that index 4 is the
`CLASS_REF`. If indices disagree, add a normalization pass that
strips tokens upstream omits.
## 8. Implementation checklist
Ordered smallest-first; each is its own commit.
1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`).
Includes `hs-raw->api-token`, type-map, op-name table,
`hs-stream-token/consume/has-more`, EOF sentinel with
`"<<<EOF>>>"` value. No test delta yet — API-only.
2. **Extend string-escape table** in `read-string` (tokenizer):
add `\b \f \r \v \xNN`, keep existing `\n \t \\ <quote>`. Emit
structured error message `"Invalid hexadecimal escape: ..."` or
`"Unterminated string"`. Unlocks tests 2, 6, 13, 14.
3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!.
Currently only `[=!<>]=` is matched. Unlocks test 12.
4. **Add `--` line-comment support** to scan!. Currently only `//`
(through selector disambiguation) is handled. Unlocks test 5.
5. **Add `hs-tokenize-template`** variant for template-bootstrap
lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`:
recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])`
+ consumer chain, emit SX `deftest` using the helpers from step 1.
Unlocks the 16 remaining eval-only tests (test 17 already has DOM
shape).
7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run
`mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`.
Expected: 17/17, with test 17 also passing thanks to step 2's
escape fixes (it depends on `\$` / `\${` in `read-template`).
8. **Update** `plans/hs-conformance-to-100.md` row 37 to
`done (+17)` and tick the scoreboard in the same commit.
Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`,
`tests/playwright/generate-sx-tests.py`, and the two plan files —
matching the scope rule in the conformance plan. `shared/static/wasm/
sx/hs-runtime.sx` must be re-copied after each runtime edit.