Files
rose-ash/plans/designs/e37-tokenizer-api.md
giles 87cafaaa3f HS-design: E37 Tokenizer-as-API
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:08:02 +00:00

17 KiB
Raw Blame History

E37 — Tokenizer-as-API

Cluster 37 of plans/hs-conformance-to-100.md. 17 tests in hs-upstream-core/tokenizer. All 17 are emitted as SKIP (untranslated) by tests/playwright/generate-sx-tests.py: the JS bodies call _hyperscript.internals.tokenizer.tokenize(...) and inspect a token-stream surface the SX port does not expose.

Work breaks into: (1) an SX API over the existing hs-tokenize mimicking the upstream stream object; (2) a compatibility shim over token fields; (3) a generator pattern recognising _hyperscript.internals.tokenizer.tokenize(src[, templateMode]). No tokenizer-grammar rewrite is required. Position tracking (start/end/line/column) is scoped to E38 (SourceInfo API).

1. Failing tests

Every eval-only test calls _hyperscript.internals.tokenizer.tokenize plus one or more of .token(i), .consumeToken(), .hasMore(), .list, .type, .value, .op.

  1. handles $ in template properlytokenize('"', true).token(0).value'"'. templateMode + token(i).
  2. handles all special escapes — 6 × tokenize('"\\X"').consumeToken().value for \b \f \n \r \t \v.
  3. handles basic token types — 15 asserts for IDENTIFIER NUMBER CLASS_REF ID_REF STRING; includes 1e6, 1e-6, 1.1e6, 1.1e-6; plus .hasMore().
  4. handles class identifiers — 9 .a-style; uses .consumeToken() and .list[3]/.list[4].
  5. handles comments properly — 13 asserts on tokenize(src).list.length; -- / // to EOL emit nothing.
  6. handles hex escapes — 3 \\xNN decodes + 4 error-path asserts matching /Invalid hexadecimal escape/.
  7. handles id references — mirror of 4 for #aID_REF.
  8. handles identifiers properly — whitespace + comment skipping between multiple consumeToken() calls.
  9. handles identifiers with numbersf1oo / fo1o / foo1IDENTIFIER.
  10. handles look ahead propertytokenize("a 1 + 1").token(0..4)["a" "1" "+" "1" "<<<EOF>>>"].
  11. handles numbers properly — 8 asserts incl. 1.1.1NUMBER PERIOD NUMBER.
  12. handles operators properly — iterates 27 ops (+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===) asserting token.op === true and token.value === key.
  13. handles strings properly — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching /Unterminated string/.
  14. handles strings properly 2 — subset of 13.
  15. handles template bootstrap — 5 tokenize(src, true) cases asserting the lexical char-level stream (", $, {, inner, }, ").
  16. handles whitespace properly — 16 asserts on .list.length for space / \n / \r / \t.
  17. string interpolation isnt surprising — DOM-shaped (not eval-only); asserts \$/\${ escapes in templates. Touches read-template, not the stream API.

2. Upstream API shape

From https://hyperscript.org/docs/#api and node_modules/hyperscript.org/src/_hyperscript.js:

const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
//   → { list, source, hasMore, matchTokenType, token, consumeToken,
//       requireTokenType, ... }
tokens.list           // Array<Token> — lookahead window
tokens.source         // original src string
tokens.token(i)       // i-th un-consumed token (0 = current); returns
                      //   { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore()      // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.

Each Token is:

{
  type:   "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
        | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
  value:  string,
  op:     boolean,   // true for punctuation/operator tokens
  start:  number,    // char offset
  end:    number,
  line:   number,
  column: number,
  source: string,    // reference to full src
}

The conformance tests only read type, value, op, and occasionally random-index into .list. They never read start/end/line/column, so position tracking is not required for cluster E37.

3. Proposed SX surface

Add three things to lib/hyperscript/runtime.sx (exposed by name, so SX test bodies can call them directly through eval-hs or assert=):

(hs-tokens-of src)              ; => dict — new token-stream object
(hs-tokens-of src :template)    ; templateMode variant
(hs-token-type tok)             ; upstream-style type name
(hs-token-value tok)            ; string value
(hs-token-op? tok)              ; bool

A token stream is a mutable dict:

{ :source  src
  :list    (list-of-tokens)   ; upstream-shaped, :type :value :op
  :pos     0 }                ; cursor into :list

With three pure-SX consumer helpers:

(hs-stream-token  stream i)   ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream)    ; returns current token, advances :pos
(hs-stream-has-more stream)   ; not EOF and pos < len

Worked example

(let ((s (hs-tokens-of "1.1")))
  (hs-token-type (hs-stream-consume s)))        ; => "NUMBER"

(let ((s (hs-tokens-of "a 1 + 1")))
  (list (hs-token-value (hs-stream-token s 0))   ; "a"
        (hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"

All helpers are ordinary defines — no platform primitives, no FFI. The generator emits them as bare calls inside deftest bodies.

4. Runtime architecture

The existing hs-tokenize emits tokens with:

{ :type  "keyword" | "ident" | "number" | "string" | "class" | "id"
       | "op" | "paren-open" | ... | "eof"
  :value V
  :pos   P }

The upstream contract uses SCREAMING_SNAKE_CASE and a dedicated boolean .op flag rather than a merged type/punctuation taxonomy. Rather than rewrite the tokenizer, add a translation layer.

Type map (SX-native → upstream)

"ident"         → "IDENTIFIER"           (keywords too: see note)
"keyword"       → "IDENTIFIER"           (upstream tokenizes keywords as idents)
"number"        → "NUMBER"
"string"        → "STRING"
"class"         → "CLASS_REF"            (:value becomes ".a" with leading dot)
"id"            → "ID_REF"               (:value becomes "#a" with leading hash)
"attr"          → "ATTRIBUTE_REF"
"style"         → "STYLE_REF"
"selector"      → "QUERY_REF"            (used by tests? upstream calls it QUERY_REF)
"template"      → one-shot: see templateMode below
"eof"           → "EOF"   with :value "<<<EOF>>>"
"paren-open"    → "L_PAREN"     + :op true
"paren-close"   → "R_PAREN"     + :op true
"bracket-open"  → "L_BRACKET"   + :op true
"bracket-close" → "R_BRACKET"   + :op true
"brace-open"    → "L_BRACE"     + :op true
"brace-close"  → "R_BRACE"      + :op true
"comma"         → "COMMA"       + :op true
"dot"           → "PERIOD"      + :op true
"op"            → name-by-value lookup (see below) + :op true

A tiny op-name table (1525 entries) maps :value strings to the upstream token type name:

"+"   → "PLUS"
"-"   → "MINUS"
"*"   → "MULTIPLY"
"/"   → "SLASH"        ; current code uses "op"/"/"
":"   → "COLON"        ; not yet emitted as own token — fix below
"%"   → "PERCENT"
"|"   → "PIPE"
"!"   → "EXCLAMATION"
"?"   → "QUESTION"
"#"   → "POUND"
"&"   → "AMPERSAND"
";"   → "SEMI"
"="   → "EQUALS"
"<"   → "L_ANG"
">"   → "R_ANG"
"<="  → "LTE_ANG"
">="  → "GTE_ANG"
"=="  → "EQ"
"===" → "EQQ"
"\\"  → "BACKSLASH"
"'s"  → "APOSTROPHE_S" ; not a true operator — elided from test 12

Conversion entry point

(define (hs-api-tokens src template-mode?)
  (let ((raw (if template-mode?
                 (hs-tokenize-template src)    ; new variant
                 (hs-tokenize src))))
    {:source  src
     :list    (map hs-raw->api-token raw)
     :pos     0}))

hs-raw->api-token is a pure mapping function using the tables above. An EOF token is always present at the end (the current tokenizer already emits one).

Token gaps to fix

Three issues turn up while writing the map; all are trivial one-site fixes in tokenizer.sx:

  • : is currently consumed as part of the local prefix (:name). Upstream tests expect bare : alone to produce COLON; only when followed by ident-start does it combine. The test suite does not exercise the bare form (it is only covered by the operator table in test 12). Fix by emitting "op" ":" when the next char is not an ident start — already what the code does; the op-name map above covers it.
  • === and == — current tokenizer emits "op" "=" plus another "=", not "==". Extend the =/!/</> lookahead clause to also match a third = after ==.
  • Template mode — upstream tokenize(src, true) splits backtick-templates into their lexical parts rather than the single "template" token the current code emits. Add a second top-level scanner hs-tokenize-template used only for the API wrapper; the primary parser continues to call hs-tokenize unchanged. The template-mode tests (1, 15) only require character-level emission of the " $ { inner } " sequence — no semantic re-use by the parser.

Stream consumer helpers

(define (hs-stream-token s i)
  (let ((list (dict-get s :list))
        (pos  (dict-get s :pos)))
    (or (nth list (+ pos i))
        (hs-eof-sentinel))))

(define (hs-stream-consume s)
  (let ((tok (hs-stream-token s 0)))
    (when (not (= (hs-token-type tok) "EOF"))
      (dict-set! s :pos (+ (dict-get s :pos) 1)))
    tok))

(define (hs-stream-has-more s)
  (not (= (hs-token-type (hs-stream-token s 0)) "EOF")))

5. Test mock strategy

All 17 tests are complexity: eval-only with empty html. They do not need the DOM runner — they only need SX expressions that resolve to the same values the JS asserts check.

Add a generator pattern to generate-sx-tests.py, slotted into generate_eval_only_test or as a new pre-pass ahead of it, that matches bodies containing _hyperscript.internals.tokenizer.tokenize. The pattern tree, by precedence:

  1. tokenize(SRC[, true]) → emit an SX let that binds a fresh stream name to (hs-tokens-of SRC [:template]).
  2. <stream>.consumeToken()(hs-stream-consume <stream>).
  3. <stream>.token(N)(hs-stream-token <stream> N).
  4. <stream>.list(dict-get <stream> :list).
  5. <stream>.list.length(len (dict-get <stream> :list)).
  6. <stream>.list[N](nth (dict-get <stream> :list) N).
  7. <stream>.hasMore()(hs-stream-has-more <stream>).
  8. <tok>.type / .value / .op(hs-token-type/value/op? <tok>).
  9. expect(X).toBe(V) and expect(X).toEqual({...})assert=.
  10. try { ... } catch (e) { errors.push(e.message) } plus expect(msg).toMatch(/pat/)(assert (regex-match? pat (guard-msg (hs-stream-consume s)))). A tiny guard-msg helper runs the expr under guard and returns the caught error's message.

The generator should emit a new deftest prologue:

  (deftest "<name>"
    (let ((s1 (hs-tokens-of "<src1>"))
          (s2 (hs-tokens-of "<src2>" :template)))
      (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
      ...))

When the test builds a results object/array of {type, value} dicts, emit one assert= per field instead of materialising a dict — simpler to debug when it fails. toEqual({type: "X", value: "Y"}) becomes two assert= lines.

The generator continues to bail (return None / emit SKIP (untranslated)) if any unrecognised JS shape appears; the 17 bodies all fit the grammar above.

6. Test delta estimate

# Test Feasible? Blockers
1 handles $ in template properly yes templateMode impl
2 handles all special escapes yes extend read-string escapes (+4 cases)
3 handles basic token types yes type-map + scientific-notation float (already in read-number? verify)
4 handles class identifiers yes type-map + .list[i] access
5 handles comments properly yes type-map; // comments already handled, -- not — add
6 handles hex escapes yes new \xNN escape + structured error
7 handles id references yes mirror of 4
8 handles identifiers properly yes type-map only
9 handles identifiers with numbers yes type-map only
10 handles look ahead property yes EOF sentinel with "<<<EOF>>>" value
11 handles numbers properly yes fix 1.1.1 scan (stop at second dot); already appears OK
12 handles operators properly yes op-name map, ==/===/<=/>= lookahead
13 handles strings properly yes structured unterminated-string error
14 handles strings properly 2 yes subset of 13
15 handles template bootstrap yes templateMode lexical emission
16 handles whitespace properly yes type-map only
17 string interpolation isnt surprising already-translatable; needs read-template \$/\${ escape

Expected: +16 to +17. Test 17 is already runnable (it is the one non-eval-only case) but depends on template-escape handling that lives in the same commit.

7. Risks / open questions

  • Position tracking. The tokenizer currently stores :pos P. Tests do not read it, so we leave it alone. E38 (SourceInfo API) will add start/end/line/column; when that lands, hs-raw->api-token should copy those through.
  • Template mode churn. Introducing hs-tokenize-template risks divergence from the main tokenizer. Mitigation: factor shared scan helpers (whitespace, identifier, operator dispatch) into named functions both variants call; keep the template variant a thin wrapper that only overrides the backtick handler.
  • Keyword vs identifier type. The current code tags reserved words as "keyword"; upstream tags every bare word as IDENTIFIER. The conformance tests always expect IDENTIFIER. Mapping both "keyword" and "ident" to "IDENTIFIER" in the API layer is safe and does not affect the parser, which consumes the raw stream, not the API stream.
  • Mutable streams. The API stream is intentionally mutable (cursor advances on consumeToken). SX dicts are mutable via dict-set! today; this is consistent with the rest of the hyperscript runtime, which uses mutable dicts in hs-activate! and the event loop.
  • Do any existing tests depend on token shape? parser.sx reads :type :value :pos. It must not see the API-shaped dicts. The API is strictly additive — hs-tokenize is unchanged; hs-parse continues to consume its output directly. Only hs-api-tokens (and its consumers) sees the upstream-shaped dicts.
  • Error-message contract. Upstream throws on unterminated strings and bad hex escapes. We currently return an EOF and emit a trailing fragment. Adding a thrown error is new behaviour; confirm the parser callers in hs-compile still produce useful diagnostics when the tokenizer raises rather than eats the input.
  • .list indexing semantics. Upstream tests read .list[3] and .list[4] directly — these indices reference upstream's raw token layout. If our SX tokenizer emits a slightly different layout (e.g. extra whitespace-related tokens, or none where upstream has one), the index tests fail even though .type/.value are correct. Verify on a spike before committing: run (hs-tokens-of "(a).a") and check that index 4 is the CLASS_REF. If indices disagree, add a normalization pass that strips tokens upstream omits.

8. Implementation checklist

Ordered smallest-first; each is its own commit.

  1. Add hs-api-tokens and token helpers (lib/hyperscript/runtime.sx). Includes hs-raw->api-token, type-map, op-name table, hs-stream-token/consume/has-more, EOF sentinel with "<<<EOF>>>" value. No test delta yet — API-only.
  2. Extend string-escape table in read-string (tokenizer): add \b \f \r \v \xNN, keep existing \n \t \\ <quote>. Emit structured error message "Invalid hexadecimal escape: ..." or "Unterminated string". Unlocks tests 2, 6, 13, 14.
  3. Add == / === / <= / >= lookahead in tokenizer scan!. Currently only [=!<>]= is matched. Unlocks test 12.
  4. Add -- line-comment support to scan!. Currently only // (through selector disambiguation) is handled. Unlocks test 5.
  5. Add hs-tokenize-template variant for template-bootstrap lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
  6. Generator pattern in tests/playwright/generate-sx-tests.py: recognise _hyperscript.internals.tokenizer.tokenize(src[, true])
    • consumer chain, emit SX deftest using the helpers from step 1. Unlocks the 16 remaining eval-only tests (test 17 already has DOM shape).
  7. Regenerate spec/tests/test-hyperscript-behavioral.sx and run mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer"). Expected: 17/17, with test 17 also passing thanks to step 2's escape fixes (it depends on \$ / \${ in read-template).
  8. Update plans/hs-conformance-to-100.md row 37 to done (+17) and tick the scoreboard in the same commit.

Work stays inside lib/hyperscript/**, shared/static/wasm/sx/hs-*, tests/playwright/generate-sx-tests.py, and the two plan files — matching the scope rule in the conformance plan. shared/static/wasm/ sx/hs-runtime.sx must be re-copied after each runtime edit.