HS-design: E37 Tokenizer-as-API

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 06:58:36 +00:00
parent 3587443742
commit 87cafaaa3f
1 changed files with 392 additions and 0 deletions
--- a/plans/designs/e37-tokenizer-api.md
+++ b/plans/designs/e37-tokenizer-api.md
@@ -0,0 +1,392 @@
+# E37 — Tokenizer-as-API
+
+Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in
+`hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP
+(untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS
+bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and
+inspect a token-stream surface the SX port does not expose.
+
+Work breaks into: (1) an SX API over the existing `hs-tokenize`
+mimicking the upstream stream object; (2) a compatibility shim over
+token fields; (3) a generator pattern recognising
+`_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No
+tokenizer-grammar rewrite is required. Position tracking
+(`start/end/line/column`) is scoped to E38 (SourceInfo API).
+
+## 1. Failing tests
+
+Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize`
+plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`,
+`.list`, `.type`, `.value`, `.op`.
+
+1. **handles $ in template properly** — `tokenize('"', true).token(0).value` → `'"'`. templateMode + `token(i)`.
+2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`.
+3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`.
+4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`.
+5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing.
+6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`.
+7. **handles id references** — mirror of 4 for `#a` → `ID_REF`.
+8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls.
+9. **handles identifiers with numbers** — `f1oo / fo1o / foo1` → `IDENTIFIER`.
+10. **handles look ahead property** — `tokenize("a 1 + 1").token(0..4)` → `["a" "1" "+" "1" "<<<EOF>>>"]`.
+11. **handles numbers properly** — 8 asserts incl. `1.1.1` → `NUMBER PERIOD NUMBER`.
+12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`.
+13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`.
+14. **handles strings properly 2** — subset of 13.
+15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`).
+16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`.
+17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API.
+
+## 2. Upstream API shape
+
+From `https://hyperscript.org/docs/#api` and
+`node_modules/hyperscript.org/src/_hyperscript.js`:
+
+```js
+const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
+//   → { list, source, hasMore, matchTokenType, token, consumeToken,
+//       requireTokenType, ... }
+tokens.list           // Array<Token> — lookahead window
+tokens.source         // original src string
+tokens.token(i)       // i-th un-consumed token (0 = current); returns
+                      //   { type: "EOF", value: "<<<EOF>>>" } past end
+tokens.consumeToken() // shift + return; throws on empty for required
+tokens.hasMore()      // true if a non-EOF token remains
+tokens.matchTokenType(type) / requireTokenType(type) / etc.
+```
+
+Each `Token` is:
+
+```js
+{
+  type:   "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
+        | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
+  value:  string,
+  op:     boolean,   // true for punctuation/operator tokens
+  start:  number,    // char offset
+  end:    number,
+  line:   number,
+  column: number,
+  source: string,    // reference to full src
+}
+```
+
+The conformance tests only read `type`, `value`, `op`, and occasionally
+random-index into `.list`. They never read `start/end/line/column`, so
+position tracking is **not** required for cluster E37.
+
+## 3. Proposed SX surface
+
+Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so
+SX test bodies can call them directly through `eval-hs` or `assert=`):
+
+```
+(hs-tokens-of src)              ; => dict — new token-stream object
+(hs-tokens-of src :template)    ; templateMode variant
+(hs-token-type tok)             ; upstream-style type name
+(hs-token-value tok)            ; string value
+(hs-token-op? tok)              ; bool
+```
+
+A token stream is a mutable dict:
+
+```
+{ :source  src
+  :list    (list-of-tokens)   ; upstream-shaped, :type :value :op
+  :pos     0 }                ; cursor into :list
+```
+
+With three pure-SX consumer helpers:
+
+```
+(hs-stream-token  stream i)   ; lookahead; returns EOF sentinel past end
+(hs-stream-consume stream)    ; returns current token, advances :pos
+(hs-stream-has-more stream)   ; not EOF and pos < len
+```
+
+### Worked example
+
+```
+(let ((s (hs-tokens-of "1.1")))
+  (hs-token-type (hs-stream-consume s)))        ; => "NUMBER"
+
+(let ((s (hs-tokens-of "a 1 + 1")))
+  (list (hs-token-value (hs-stream-token s 0))   ; "a"
+        (hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"
+```
+
+All helpers are ordinary `define`s — no platform primitives, no FFI.
+The generator emits them as bare calls inside `deftest` bodies.
+
+## 4. Runtime architecture
+
+The existing `hs-tokenize` emits tokens with:
+
+```
+{ :type  "keyword" | "ident" | "number" | "string" | "class" | "id"
+       | "op" | "paren-open" | ... | "eof"
+  :value V
+  :pos   P }
+```
+
+The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated
+boolean `.op` flag rather than a merged type/punctuation taxonomy.
+Rather than rewrite the tokenizer, add a translation layer.
+
+### Type map (SX-native → upstream)
+
+```
+"ident"         → "IDENTIFIER"           (keywords too: see note)
+"keyword"       → "IDENTIFIER"           (upstream tokenizes keywords as idents)
+"number"        → "NUMBER"
+"string"        → "STRING"
+"class"         → "CLASS_REF"            (:value becomes ".a" with leading dot)
+"id"            → "ID_REF"               (:value becomes "#a" with leading hash)
+"attr"          → "ATTRIBUTE_REF"
+"style"         → "STYLE_REF"
+"selector"      → "QUERY_REF"            (used by tests? upstream calls it QUERY_REF)
+"template"      → one-shot: see templateMode below
+"eof"           → "EOF"   with :value "<<<EOF>>>"
+"paren-open"    → "L_PAREN"     + :op true
+"paren-close"   → "R_PAREN"     + :op true
+"bracket-open"  → "L_BRACKET"   + :op true
+"bracket-close" → "R_BRACKET"   + :op true
+"brace-open"    → "L_BRACE"     + :op true
+"brace-close"  → "R_BRACE"      + :op true
+"comma"         → "COMMA"       + :op true
+"dot"           → "PERIOD"      + :op true
+"op"            → name-by-value lookup (see below) + :op true
+```
+
+A tiny op-name table (15–25 entries) maps `:value` strings to the
+upstream token type name:
+
+```
+"+"   → "PLUS"
+"-"   → "MINUS"
+"*"   → "MULTIPLY"
+"/"   → "SLASH"        ; current code uses "op"/"/"
+":"   → "COLON"        ; not yet emitted as own token — fix below
+"%"   → "PERCENT"
+"|"   → "PIPE"
+"!"   → "EXCLAMATION"
+"?"   → "QUESTION"
+"#"   → "POUND"
+"&"   → "AMPERSAND"
+";"   → "SEMI"
+"="   → "EQUALS"
+"<"   → "L_ANG"
+">"   → "R_ANG"
+"<="  → "LTE_ANG"
+">="  → "GTE_ANG"
+"=="  → "EQ"
+"===" → "EQQ"
+"\\"  → "BACKSLASH"
+"'s"  → "APOSTROPHE_S" ; not a true operator — elided from test 12
+```
+
+### Conversion entry point
+
+```
+(define (hs-api-tokens src template-mode?)
+  (let ((raw (if template-mode?
+                 (hs-tokenize-template src)    ; new variant
+                 (hs-tokenize src))))
+    {:source  src
+     :list    (map hs-raw->api-token raw)
+     :pos     0}))
+```
+
+`hs-raw->api-token` is a pure mapping function using the tables above.
+An EOF token is always present at the end (the current tokenizer
+already emits one).
+
+### Token gaps to fix
+
+Three issues turn up while writing the map; all are trivial one-site
+fixes in `tokenizer.sx`:
+
+- **`:` is currently consumed as part of the local prefix
+  (`:name`)**. Upstream tests expect bare `:` alone to produce
+  `COLON`; only when followed by `ident-start` does it combine. The
+  test suite does not exercise the bare form (it is only covered by
+  the operator table in test 12). Fix by emitting `"op" ":"` when the
+  next char is not an ident start — already what the code does; the
+  op-name map above covers it.
+- **`===` and `==`** — current tokenizer emits `"op" "="` plus another
+  `"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to
+  also match a third `=` after `==`.
+- **Template mode** — upstream `tokenize(src, true)` splits
+  backtick-templates into their lexical parts rather than the single
+  `"template"` token the current code emits. Add a second top-level
+  scanner `hs-tokenize-template` used only for the API wrapper; the
+  primary parser continues to call `hs-tokenize` unchanged. The
+  template-mode tests (1, 15) only require character-level emission
+  of the `" $ { inner } "` sequence — no semantic re-use by the
+  parser.
+
+### Stream consumer helpers
+
+```
+(define (hs-stream-token s i)
+  (let ((list (dict-get s :list))
+        (pos  (dict-get s :pos)))
+    (or (nth list (+ pos i))
+        (hs-eof-sentinel))))
+
+(define (hs-stream-consume s)
+  (let ((tok (hs-stream-token s 0)))
+    (when (not (= (hs-token-type tok) "EOF"))
+      (dict-set! s :pos (+ (dict-get s :pos) 1)))
+    tok))
+
+(define (hs-stream-has-more s)
+  (not (= (hs-token-type (hs-stream-token s 0)) "EOF")))
+```
+
+## 5. Test mock strategy
+
+All 17 tests are `complexity: eval-only` with empty `html`. They do
+not need the DOM runner — they only need SX expressions that resolve
+to the same values the JS asserts check.
+
+Add a generator pattern to `generate-sx-tests.py`, slotted into
+`generate_eval_only_test` or as a new pre-pass ahead of it, that
+matches bodies containing `_hyperscript.internals.tokenizer.tokenize`.
+The pattern tree, by precedence:
+
+1. `tokenize(SRC[, true])` → emit an SX `let` that binds a
+   fresh stream name to `(hs-tokens-of SRC [:template])`.
+2. `<stream>.consumeToken()` → `(hs-stream-consume <stream>)`.
+3. `<stream>.token(N)` → `(hs-stream-token <stream> N)`.
+4. `<stream>.list` → `(dict-get <stream> :list)`.
+5. `<stream>.list.length` → `(len (dict-get <stream> :list))`.
+6. `<stream>.list[N]` → `(nth (dict-get <stream> :list) N)`.
+7. `<stream>.hasMore()` → `(hs-stream-has-more <stream>)`.
+8. `<tok>.type` / `.value` / `.op` → `(hs-token-type/value/op? <tok>)`.
+9. `expect(X).toBe(V)` and `expect(X).toEqual({...})` → `assert=`.
+10. `try { ... } catch (e) { errors.push(e.message) }` plus
+    `expect(msg).toMatch(/pat/)` → `(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`.
+    A tiny `guard-msg` helper runs the expr under `guard` and returns
+    the caught error's message.
+
+The generator should emit a new deftest prologue:
+
+```
+  (deftest "<name>"
+    (let ((s1 (hs-tokens-of "<src1>"))
+          (s2 (hs-tokens-of "<src2>" :template)))
+      (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
+      ...))
+```
+
+When the test builds a `results` object/array of `{type, value}`
+dicts, emit one `assert=` per field instead of materialising a dict —
+simpler to debug when it fails. `toEqual({type: "X", value: "Y"})`
+becomes two `assert=` lines.
+
+The generator continues to bail (`return None` / emit
+`SKIP (untranslated)`) if any unrecognised JS shape appears; the 17
+bodies all fit the grammar above.
+
+## 6. Test delta estimate
+
+| # | Test | Feasible? | Blockers |
+|---|------|-----------|----------|
+| 1 | handles $ in template properly | yes | templateMode impl |
+| 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) |
+| 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) |
+| 4 | handles class identifiers | yes | type-map + `.list[i]` access |
+| 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add |
+| 6 | handles hex escapes | yes | new `\xNN` escape + structured error |
+| 7 | handles id references | yes | mirror of 4 |
+| 8 | handles identifiers properly | yes | type-map only |
+| 9 | handles identifiers with numbers | yes | type-map only |
+| 10 | handles look ahead property | yes | EOF sentinel with `"<<<EOF>>>"` value |
+| 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK |
+| 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead |
+| 13 | handles strings properly | yes | structured unterminated-string error |
+| 14 | handles strings properly 2 | yes | subset of 13 |
+| 15 | handles template bootstrap | yes | templateMode lexical emission |
+| 16 | handles whitespace properly | yes | type-map only |
+| 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape |
+
+Expected: **+16 to +17**. Test 17 is already runnable (it is the one
+non-eval-only case) but depends on template-escape handling that lives
+in the same commit.
+
+## 7. Risks / open questions
+
+- **Position tracking.** The tokenizer currently stores `:pos P`. Tests
+  do not read it, so we leave it alone. E38 (SourceInfo API) will add
+  `start/end/line/column`; when that lands, `hs-raw->api-token` should
+  copy those through.
+- **Template mode churn.** Introducing `hs-tokenize-template` risks
+  divergence from the main tokenizer. Mitigation: factor shared scan
+  helpers (whitespace, identifier, operator dispatch) into named
+  functions both variants call; keep the template variant a thin
+  wrapper that only overrides the backtick handler.
+- **Keyword vs identifier type.** The current code tags reserved words
+  as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The
+  conformance tests always expect `IDENTIFIER`. Mapping both
+  `"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is
+  safe and does **not** affect the parser, which consumes the raw
+  stream, not the API stream.
+- **Mutable streams.** The API stream is intentionally mutable (cursor
+  advances on `consumeToken`). SX dicts are mutable via `dict-set!`
+  today; this is consistent with the rest of the hyperscript runtime,
+  which uses mutable dicts in `hs-activate!` and the event loop.
+- **Do any existing tests depend on token shape?** `parser.sx` reads
+  `:type :value :pos`. It must **not** see the API-shaped dicts. The
+  API is strictly additive — `hs-tokenize` is unchanged; `hs-parse`
+  continues to consume its output directly. Only `hs-api-tokens`
+  (and its consumers) sees the upstream-shaped dicts.
+- **Error-message contract.** Upstream throws on unterminated strings
+  and bad hex escapes. We currently return an EOF and emit a
+  trailing fragment. Adding a thrown error is new behaviour; confirm
+  the parser callers in `hs-compile` still produce useful diagnostics
+  when the tokenizer raises rather than eats the input.
+- **`.list` indexing semantics.** Upstream tests read `.list[3]` and
+  `.list[4]` directly — these indices reference upstream's raw
+  token layout. If our SX tokenizer emits a slightly different
+  layout (e.g. extra whitespace-related tokens, or none where
+  upstream has one), the index tests fail even though `.type`/`.value`
+  are correct. Verify on a spike before committing: run
+  `(hs-tokens-of "(a).a")` and check that index 4 is the
+  `CLASS_REF`. If indices disagree, add a normalization pass that
+  strips tokens upstream omits.
+
+## 8. Implementation checklist
+
+Ordered smallest-first; each is its own commit.
+
+1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`).
+   Includes `hs-raw->api-token`, type-map, op-name table,
+   `hs-stream-token/consume/has-more`, EOF sentinel with
+   `"<<<EOF>>>"` value. No test delta yet — API-only.
+2. **Extend string-escape table** in `read-string` (tokenizer):
+   add `\b \f \r \v \xNN`, keep existing `\n \t \\ <quote>`. Emit
+   structured error message `"Invalid hexadecimal escape: ..."` or
+   `"Unterminated string"`. Unlocks tests 2, 6, 13, 14.
+3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!.
+   Currently only `[=!<>]=` is matched. Unlocks test 12.
+4. **Add `--` line-comment support** to scan!. Currently only `//`
+   (through selector disambiguation) is handled. Unlocks test 5.
+5. **Add `hs-tokenize-template`** variant for template-bootstrap
+   lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
+6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`:
+   recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])`
+   + consumer chain, emit SX `deftest` using the helpers from step 1.
+   Unlocks the 16 remaining eval-only tests (test 17 already has DOM
+   shape).
+7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run
+   `mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`.
+   Expected: 17/17, with test 17 also passing thanks to step 2's
+   escape fixes (it depends on `\$` / `\${` in `read-template`).
+8. **Update** `plans/hs-conformance-to-100.md` row 37 to
+   `done (+17)` and tick the scoreboard in the same commit.
+
+Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`,
+`tests/playwright/generate-sx-tests.py`, and the two plan files —
+matching the scope rule in the conformance plan. `shared/static/wasm/
+sx/hs-runtime.sx` must be re-copied after each runtime edit.