Files

giles 87cafaaa3f HS-design: E37 Tokenizer-as-API

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:08:02 +00:00

17 KiB

Raw Blame History

E37 — Tokenizer-as-API

Cluster 37 of plans/hs-conformance-to-100.md. 17 tests in hs-upstream-core/tokenizer. All 17 are emitted as SKIP (untranslated) by tests/playwright/generate-sx-tests.py: the JS bodies call _hyperscript.internals.tokenizer.tokenize(...) and inspect a token-stream surface the SX port does not expose.

Work breaks into: (1) an SX API over the existing hs-tokenize mimicking the upstream stream object; (2) a compatibility shim over token fields; (3) a generator pattern recognising _hyperscript.internals.tokenizer.tokenize(src[, templateMode]). No tokenizer-grammar rewrite is required. Position tracking (start/end/line/column) is scoped to E38 (SourceInfo API).

1. Failing tests

Every eval-only test calls _hyperscript.internals.tokenizer.tokenize plus one or more of .token(i), .consumeToken(), .hasMore(), .list, .type, .value, .op.

handles $ in template properly — tokenize('"', true).token(0).value → '"'. templateMode + token(i).
handles all special escapes — 6 × tokenize('"\\X"').consumeToken().value for \b \f \n \r \t \v.
handles basic token types — 15 asserts for IDENTIFIER NUMBER CLASS_REF ID_REF STRING; includes 1e6, 1e-6, 1.1e6, 1.1e-6; plus .hasMore().
handles class identifiers — 9 .a-style; uses .consumeToken() and .list[3]/.list[4].
handles comments properly — 13 asserts on tokenize(src).list.length; -- / // to EOL emit nothing.
handles hex escapes — 3 \\xNN decodes + 4 error-path asserts matching /Invalid hexadecimal escape/.
handles id references — mirror of 4 for #a → ID_REF.
handles identifiers properly — whitespace + comment skipping between multiple consumeToken() calls.
handles identifiers with numbers — f1oo / fo1o / foo1 → IDENTIFIER.
handles look ahead property — tokenize("a 1 + 1").token(0..4) → ["a" "1" "+" "1" "<<<EOF>>>"].
handles numbers properly — 8 asserts incl. 1.1.1 → NUMBER PERIOD NUMBER.
handles operators properly — iterates 27 ops (+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===) asserting token.op === true and token.value === key.
handles strings properly — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching /Unterminated string/.
handles strings properly 2 — subset of 13.
handles template bootstrap — 5 tokenize(src, true) cases asserting the lexical char-level stream (", $, {, inner, }, ").
handles whitespace properly — 16 asserts on .list.length for space / \n / \r / \t.
string interpolation isnt surprising — DOM-shaped (not eval-only); asserts \$/\${ escapes in templates. Touches read-template, not the stream API.

2. Upstream API shape

From https://hyperscript.org/docs/#api and node_modules/hyperscript.org/src/_hyperscript.js:

const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
//   → { list, source, hasMore, matchTokenType, token, consumeToken,
//       requireTokenType, ... }
tokens.list           // Array<Token> — lookahead window
tokens.source         // original src string
tokens.token(i)       // i-th un-consumed token (0 = current); returns
                      //   { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore()      // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.

Each Token is:

{
  type:   "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
        | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
  value:  string,
  op:     boolean,   // true for punctuation/operator tokens
  start:  number,    // char offset
  end:    number,
  line:   number,
  column: number,
  source: string,    // reference to full src
}

The conformance tests only read type, value, op, and occasionally random-index into .list. They never read start/end/line/column, so position tracking is not required for cluster E37.

3. Proposed SX surface

Add three things to lib/hyperscript/runtime.sx (exposed by name, so SX test bodies can call them directly through eval-hs or assert=):

(hs-tokens-of src)              ; => dict — new token-stream object
(hs-tokens-of src :template)    ; templateMode variant
(hs-token-type tok)             ; upstream-style type name
(hs-token-value tok)            ; string value
(hs-token-op? tok)              ; bool

A token stream is a mutable dict:

{ :source  src
  :list    (list-of-tokens)   ; upstream-shaped, :type :value :op
  :pos     0 }                ; cursor into :list

With three pure-SX consumer helpers:

(hs-stream-token  stream i)   ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream)    ; returns current token, advances :pos
(hs-stream-has-more stream)   ; not EOF and pos < len

Worked example

(let ((s (hs-tokens-of "1.1")))
  (hs-token-type (hs-stream-consume s)))        ; => "NUMBER"

(let ((s (hs-tokens-of "a 1 + 1")))
  (list (hs-token-value (hs-stream-token s 0))   ; "a"
        (hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"

All helpers are ordinary defines — no platform primitives, no FFI. The generator emits them as bare calls inside deftest bodies.

4. Runtime architecture

The existing hs-tokenize emits tokens with:

{ :type  "keyword" | "ident" | "number" | "string" | "class" | "id"
       | "op" | "paren-open" | ... | "eof"
  :value V
  :pos   P }

The upstream contract uses SCREAMING_SNAKE_CASE and a dedicated boolean .op flag rather than a merged type/punctuation taxonomy. Rather than rewrite the tokenizer, add a translation layer.

Type map (SX-native → upstream)

"ident"         → "IDENTIFIER"           (keywords too: see note)
"keyword"       → "IDENTIFIER"           (upstream tokenizes keywords as idents)
"number"        → "NUMBER"
"string"        → "STRING"
"class"         → "CLASS_REF"            (:value becomes ".a" with leading dot)
"id"            → "ID_REF"               (:value becomes "#a" with leading hash)
"attr"          → "ATTRIBUTE_REF"
"style"         → "STYLE_REF"
"selector"      → "QUERY_REF"            (used by tests? upstream calls it QUERY_REF)
"template"      → one-shot: see templateMode below
"eof"           → "EOF"   with :value "<<<EOF>>>"
"paren-open"    → "L_PAREN"     + :op true
"paren-close"   → "R_PAREN"     + :op true
"bracket-open"  → "L_BRACKET"   + :op true
"bracket-close" → "R_BRACKET"   + :op true
"brace-open"    → "L_BRACE"     + :op true
"brace-close"  → "R_BRACE"      + :op true
"comma"         → "COMMA"       + :op true
"dot"           → "PERIOD"      + :op true
"op"            → name-by-value lookup (see below) + :op true

A tiny op-name table (15–25 entries) maps :value strings to the upstream token type name:

"+"   → "PLUS"
"-"   → "MINUS"
"*"   → "MULTIPLY"
"/"   → "SLASH"        ; current code uses "op"/"/"
":"   → "COLON"        ; not yet emitted as own token — fix below
"%"   → "PERCENT"
"|"   → "PIPE"
"!"   → "EXCLAMATION"
"?"   → "QUESTION"
"#"   → "POUND"
"&"   → "AMPERSAND"
";"   → "SEMI"
"="   → "EQUALS"
"<"   → "L_ANG"
">"   → "R_ANG"
"<="  → "LTE_ANG"
">="  → "GTE_ANG"
"=="  → "EQ"
"===" → "EQQ"
"\\"  → "BACKSLASH"
"'s"  → "APOSTROPHE_S" ; not a true operator — elided from test 12

Conversion entry point

(define (hs-api-tokens src template-mode?)
  (let ((raw (if template-mode?
                 (hs-tokenize-template src)    ; new variant
                 (hs-tokenize src))))
    {:source  src
     :list    (map hs-raw->api-token raw)
     :pos     0}))

hs-raw->api-token is a pure mapping function using the tables above. An EOF token is always present at the end (the current tokenizer already emits one).

Token gaps to fix

Three issues turn up while writing the map; all are trivial one-site fixes in tokenizer.sx:

: is currently consumed as part of the local prefix (:name). Upstream tests expect bare : alone to produce COLON; only when followed by ident-start does it combine. The test suite does not exercise the bare form (it is only covered by the operator table in test 12). Fix by emitting "op" ":" when the next char is not an ident start — already what the code does; the op-name map above covers it.
=== and == — current tokenizer emits "op" "=" plus another "=", not "==". Extend the =/!/</> lookahead clause to also match a third = after ==.
Template mode — upstream tokenize(src, true) splits backtick-templates into their lexical parts rather than the single "template" token the current code emits. Add a second top-level scanner hs-tokenize-template used only for the API wrapper; the primary parser continues to call hs-tokenize unchanged. The template-mode tests (1, 15) only require character-level emission of the " $ { inner } " sequence — no semantic re-use by the parser.

Stream consumer helpers

(define (hs-stream-token s i)
  (let ((list (dict-get s :list))
        (pos  (dict-get s :pos)))
    (or (nth list (+ pos i))
        (hs-eof-sentinel))))

(define (hs-stream-consume s)
  (let ((tok (hs-stream-token s 0)))
    (when (not (= (hs-token-type tok) "EOF"))
      (dict-set! s :pos (+ (dict-get s :pos) 1)))
    tok))

(define (hs-stream-has-more s)
  (not (= (hs-token-type (hs-stream-token s 0)) "EOF")))

5. Test mock strategy

All 17 tests are complexity: eval-only with empty html. They do not need the DOM runner — they only need SX expressions that resolve to the same values the JS asserts check.

Add a generator pattern to generate-sx-tests.py, slotted into generate_eval_only_test or as a new pre-pass ahead of it, that matches bodies containing _hyperscript.internals.tokenizer.tokenize. The pattern tree, by precedence:

tokenize(SRC[, true]) → emit an SX let that binds a fresh stream name to (hs-tokens-of SRC [:template]).
<stream>.consumeToken() → (hs-stream-consume <stream>).
<stream>.token(N) → (hs-stream-token <stream> N).
<stream>.list → (dict-get <stream> :list).
<stream>.list.length → (len (dict-get <stream> :list)).
<stream>.list[N] → (nth (dict-get <stream> :list) N).
<stream>.hasMore() → (hs-stream-has-more <stream>).
<tok>.type / .value / .op → (hs-token-type/value/op? <tok>).
expect(X).toBe(V) and expect(X).toEqual({...}) → assert=.
try { ... } catch (e) { errors.push(e.message) } plus expect(msg).toMatch(/pat/) → (assert (regex-match? pat (guard-msg (hs-stream-consume s)))). A tiny guard-msg helper runs the expr under guard and returns the caught error's message.

The generator should emit a new deftest prologue:

  (deftest "<name>"
    (let ((s1 (hs-tokens-of "<src1>"))
          (s2 (hs-tokens-of "<src2>" :template)))
      (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
      ...))

When the test builds a results object/array of {type, value} dicts, emit one assert= per field instead of materialising a dict — simpler to debug when it fails. toEqual({type: "X", value: "Y"}) becomes two assert= lines.

The generator continues to bail (return None / emit SKIP (untranslated)) if any unrecognised JS shape appears; the 17 bodies all fit the grammar above.

6. Test delta estimate

#	Test	Feasible?	Blockers
1	handles $ in template properly	yes	templateMode impl
2	handles all special escapes	yes	extend `read-string` escapes (+4 cases)
3	handles basic token types	yes	type-map + scientific-notation float (already in `read-number`? verify)
4	handles class identifiers	yes	type-map + `.list[i]` access
5	handles comments properly	yes	type-map; `//` comments already handled, `--` not — add
6	handles hex escapes	yes	new `\xNN` escape + structured error
7	handles id references	yes	mirror of 4
8	handles identifiers properly	yes	type-map only
9	handles identifiers with numbers	yes	type-map only
10	handles look ahead property	yes	EOF sentinel with `"<<<EOF>>>"` value
11	handles numbers properly	yes	fix `1.1.1` scan (stop at second dot); already appears OK
12	handles operators properly	yes	op-name map, `==`/`===`/`<=`/`>=` lookahead
13	handles strings properly	yes	structured unterminated-string error
14	handles strings properly 2	yes	subset of 13
15	handles template bootstrap	yes	templateMode lexical emission
16	handles whitespace properly	yes	type-map only
17	string interpolation isnt surprising	already-translatable; needs `read-template` `\$`/`\${` escape

Expected: +16 to +17. Test 17 is already runnable (it is the one non-eval-only case) but depends on template-escape handling that lives in the same commit.

7. Risks / open questions

Position tracking. The tokenizer currently stores :pos P. Tests do not read it, so we leave it alone. E38 (SourceInfo API) will add start/end/line/column; when that lands, hs-raw->api-token should copy those through.
Template mode churn. Introducing hs-tokenize-template risks divergence from the main tokenizer. Mitigation: factor shared scan helpers (whitespace, identifier, operator dispatch) into named functions both variants call; keep the template variant a thin wrapper that only overrides the backtick handler.
Keyword vs identifier type. The current code tags reserved words as "keyword"; upstream tags every bare word as IDENTIFIER. The conformance tests always expect IDENTIFIER. Mapping both "keyword" and "ident" to "IDENTIFIER" in the API layer is safe and does not affect the parser, which consumes the raw stream, not the API stream.
Mutable streams. The API stream is intentionally mutable (cursor advances on consumeToken). SX dicts are mutable via dict-set! today; this is consistent with the rest of the hyperscript runtime, which uses mutable dicts in hs-activate! and the event loop.
Do any existing tests depend on token shape? parser.sx reads :type :value :pos. It must not see the API-shaped dicts. The API is strictly additive — hs-tokenize is unchanged; hs-parse continues to consume its output directly. Only hs-api-tokens (and its consumers) sees the upstream-shaped dicts.
Error-message contract. Upstream throws on unterminated strings and bad hex escapes. We currently return an EOF and emit a trailing fragment. Adding a thrown error is new behaviour; confirm the parser callers in hs-compile still produce useful diagnostics when the tokenizer raises rather than eats the input.
.list indexing semantics. Upstream tests read .list[3] and .list[4] directly — these indices reference upstream's raw token layout. If our SX tokenizer emits a slightly different layout (e.g. extra whitespace-related tokens, or none where upstream has one), the index tests fail even though .type/.value are correct. Verify on a spike before committing: run (hs-tokens-of "(a).a") and check that index 4 is the CLASS_REF. If indices disagree, add a normalization pass that strips tokens upstream omits.

8. Implementation checklist

Ordered smallest-first; each is its own commit.

Add hs-api-tokens and token helpers (lib/hyperscript/runtime.sx). Includes hs-raw->api-token, type-map, op-name table, hs-stream-token/consume/has-more, EOF sentinel with "<<<EOF>>>" value. No test delta yet — API-only.
Extend string-escape table in read-string (tokenizer): add \b \f \r \v \xNN, keep existing \n \t \\ <quote>. Emit structured error message "Invalid hexadecimal escape: ..." or "Unterminated string". Unlocks tests 2, 6, 13, 14.
Add == / === / <= / >= lookahead in tokenizer scan!. Currently only [=!<>]= is matched. Unlocks test 12.
Add -- line-comment support to scan!. Currently only // (through selector disambiguation) is handled. Unlocks test 5.
Add hs-tokenize-template variant for template-bootstrap lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
Generator pattern in tests/playwright/generate-sx-tests.py: recognise _hyperscript.internals.tokenizer.tokenize(src[, true])
- consumer chain, emit SX deftest using the helpers from step 1. Unlocks the 16 remaining eval-only tests (test 17 already has DOM shape).
Regenerate spec/tests/test-hyperscript-behavioral.sx and run mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer"). Expected: 17/17, with test 17 also passing thanks to step 2's escape fixes (it depends on \$ / \${ in read-template).
Update plans/hs-conformance-to-100.md row 37 to done (+17) and tick the scoreboard in the same commit.

Work stays inside lib/hyperscript/**, shared/static/wasm/sx/hs-*, tests/playwright/generate-sx-tests.py, and the two plan files — matching the scope rule in the conformance plan. shared/static/wasm/ sx/hs-runtime.sx must be re-copied after each runtime edit.

17 KiB Raw Blame History Unescape Escape