17 KiB
E37 — Tokenizer-as-API
Cluster 37 of plans/hs-conformance-to-100.md. 17 tests in
hs-upstream-core/tokenizer. All 17 are emitted as SKIP (untranslated) by tests/playwright/generate-sx-tests.py: the JS
bodies call _hyperscript.internals.tokenizer.tokenize(...) and
inspect a token-stream surface the SX port does not expose.
Work breaks into: (1) an SX API over the existing hs-tokenize
mimicking the upstream stream object; (2) a compatibility shim over
token fields; (3) a generator pattern recognising
_hyperscript.internals.tokenizer.tokenize(src[, templateMode]). No
tokenizer-grammar rewrite is required. Position tracking
(start/end/line/column) is scoped to E38 (SourceInfo API).
1. Failing tests
Every eval-only test calls _hyperscript.internals.tokenizer.tokenize
plus one or more of .token(i), .consumeToken(), .hasMore(),
.list, .type, .value, .op.
- handles $ in template properly —
tokenize('"', true).token(0).value→'"'. templateMode +token(i). - handles all special escapes — 6 ×
tokenize('"\\X"').consumeToken().valuefor\b \f \n \r \t \v. - handles basic token types — 15 asserts for
IDENTIFIER NUMBER CLASS_REF ID_REF STRING; includes1e6,1e-6,1.1e6,1.1e-6; plus.hasMore(). - handles class identifiers — 9
.a-style; uses.consumeToken()and.list[3]/.list[4]. - handles comments properly — 13 asserts on
tokenize(src).list.length;--///to EOL emit nothing. - handles hex escapes — 3
\\xNNdecodes + 4 error-path asserts matching/Invalid hexadecimal escape/. - handles id references — mirror of 4 for
#a→ID_REF. - handles identifiers properly — whitespace + comment skipping between multiple
consumeToken()calls. - handles identifiers with numbers —
f1oo / fo1o / foo1→IDENTIFIER. - handles look ahead property —
tokenize("a 1 + 1").token(0..4)→["a" "1" "+" "1" "<<<EOF>>>"]. - handles numbers properly — 8 asserts incl.
1.1.1→NUMBER PERIOD NUMBER. - handles operators properly — iterates 27 ops (
+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===) assertingtoken.op === trueandtoken.value === key. - handles strings properly — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching
/Unterminated string/. - handles strings properly 2 — subset of 13.
- handles template bootstrap — 5
tokenize(src, true)cases asserting the lexical char-level stream (",$,{, inner,},"). - handles whitespace properly — 16 asserts on
.list.lengthfor space /\n/\r/\t. - string interpolation isnt surprising — DOM-shaped (not eval-only); asserts
\$/\${escapes in templates. Touchesread-template, not the stream API.
2. Upstream API shape
From https://hyperscript.org/docs/#api and
node_modules/hyperscript.org/src/_hyperscript.js:
const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
// → { list, source, hasMore, matchTokenType, token, consumeToken,
// requireTokenType, ... }
tokens.list // Array<Token> — lookahead window
tokens.source // original src string
tokens.token(i) // i-th un-consumed token (0 = current); returns
// { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore() // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.
Each Token is:
{
type: "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
| "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
value: string,
op: boolean, // true for punctuation/operator tokens
start: number, // char offset
end: number,
line: number,
column: number,
source: string, // reference to full src
}
The conformance tests only read type, value, op, and occasionally
random-index into .list. They never read start/end/line/column, so
position tracking is not required for cluster E37.
3. Proposed SX surface
Add three things to lib/hyperscript/runtime.sx (exposed by name, so
SX test bodies can call them directly through eval-hs or assert=):
(hs-tokens-of src) ; => dict — new token-stream object
(hs-tokens-of src :template) ; templateMode variant
(hs-token-type tok) ; upstream-style type name
(hs-token-value tok) ; string value
(hs-token-op? tok) ; bool
A token stream is a mutable dict:
{ :source src
:list (list-of-tokens) ; upstream-shaped, :type :value :op
:pos 0 } ; cursor into :list
With three pure-SX consumer helpers:
(hs-stream-token stream i) ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream) ; returns current token, advances :pos
(hs-stream-has-more stream) ; not EOF and pos < len
Worked example
(let ((s (hs-tokens-of "1.1")))
(hs-token-type (hs-stream-consume s))) ; => "NUMBER"
(let ((s (hs-tokens-of "a 1 + 1")))
(list (hs-token-value (hs-stream-token s 0)) ; "a"
(hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"
All helpers are ordinary defines — no platform primitives, no FFI.
The generator emits them as bare calls inside deftest bodies.
4. Runtime architecture
The existing hs-tokenize emits tokens with:
{ :type "keyword" | "ident" | "number" | "string" | "class" | "id"
| "op" | "paren-open" | ... | "eof"
:value V
:pos P }
The upstream contract uses SCREAMING_SNAKE_CASE and a dedicated
boolean .op flag rather than a merged type/punctuation taxonomy.
Rather than rewrite the tokenizer, add a translation layer.
Type map (SX-native → upstream)
"ident" → "IDENTIFIER" (keywords too: see note)
"keyword" → "IDENTIFIER" (upstream tokenizes keywords as idents)
"number" → "NUMBER"
"string" → "STRING"
"class" → "CLASS_REF" (:value becomes ".a" with leading dot)
"id" → "ID_REF" (:value becomes "#a" with leading hash)
"attr" → "ATTRIBUTE_REF"
"style" → "STYLE_REF"
"selector" → "QUERY_REF" (used by tests? upstream calls it QUERY_REF)
"template" → one-shot: see templateMode below
"eof" → "EOF" with :value "<<<EOF>>>"
"paren-open" → "L_PAREN" + :op true
"paren-close" → "R_PAREN" + :op true
"bracket-open" → "L_BRACKET" + :op true
"bracket-close" → "R_BRACKET" + :op true
"brace-open" → "L_BRACE" + :op true
"brace-close" → "R_BRACE" + :op true
"comma" → "COMMA" + :op true
"dot" → "PERIOD" + :op true
"op" → name-by-value lookup (see below) + :op true
A tiny op-name table (15–25 entries) maps :value strings to the
upstream token type name:
"+" → "PLUS"
"-" → "MINUS"
"*" → "MULTIPLY"
"/" → "SLASH" ; current code uses "op"/"/"
":" → "COLON" ; not yet emitted as own token — fix below
"%" → "PERCENT"
"|" → "PIPE"
"!" → "EXCLAMATION"
"?" → "QUESTION"
"#" → "POUND"
"&" → "AMPERSAND"
";" → "SEMI"
"=" → "EQUALS"
"<" → "L_ANG"
">" → "R_ANG"
"<=" → "LTE_ANG"
">=" → "GTE_ANG"
"==" → "EQ"
"===" → "EQQ"
"\\" → "BACKSLASH"
"'s" → "APOSTROPHE_S" ; not a true operator — elided from test 12
Conversion entry point
(define (hs-api-tokens src template-mode?)
(let ((raw (if template-mode?
(hs-tokenize-template src) ; new variant
(hs-tokenize src))))
{:source src
:list (map hs-raw->api-token raw)
:pos 0}))
hs-raw->api-token is a pure mapping function using the tables above.
An EOF token is always present at the end (the current tokenizer
already emits one).
Token gaps to fix
Three issues turn up while writing the map; all are trivial one-site
fixes in tokenizer.sx:
:is currently consumed as part of the local prefix (:name). Upstream tests expect bare:alone to produceCOLON; only when followed byident-startdoes it combine. The test suite does not exercise the bare form (it is only covered by the operator table in test 12). Fix by emitting"op" ":"when the next char is not an ident start — already what the code does; the op-name map above covers it.===and==— current tokenizer emits"op" "="plus another"=", not"==". Extend the=/!/</>lookahead clause to also match a third=after==.- Template mode — upstream
tokenize(src, true)splits backtick-templates into their lexical parts rather than the single"template"token the current code emits. Add a second top-level scannerhs-tokenize-templateused only for the API wrapper; the primary parser continues to callhs-tokenizeunchanged. The template-mode tests (1, 15) only require character-level emission of the" $ { inner } "sequence — no semantic re-use by the parser.
Stream consumer helpers
(define (hs-stream-token s i)
(let ((list (dict-get s :list))
(pos (dict-get s :pos)))
(or (nth list (+ pos i))
(hs-eof-sentinel))))
(define (hs-stream-consume s)
(let ((tok (hs-stream-token s 0)))
(when (not (= (hs-token-type tok) "EOF"))
(dict-set! s :pos (+ (dict-get s :pos) 1)))
tok))
(define (hs-stream-has-more s)
(not (= (hs-token-type (hs-stream-token s 0)) "EOF")))
5. Test mock strategy
All 17 tests are complexity: eval-only with empty html. They do
not need the DOM runner — they only need SX expressions that resolve
to the same values the JS asserts check.
Add a generator pattern to generate-sx-tests.py, slotted into
generate_eval_only_test or as a new pre-pass ahead of it, that
matches bodies containing _hyperscript.internals.tokenizer.tokenize.
The pattern tree, by precedence:
tokenize(SRC[, true])→ emit an SXletthat binds a fresh stream name to(hs-tokens-of SRC [:template]).<stream>.consumeToken()→(hs-stream-consume <stream>).<stream>.token(N)→(hs-stream-token <stream> N).<stream>.list→(dict-get <stream> :list).<stream>.list.length→(len (dict-get <stream> :list)).<stream>.list[N]→(nth (dict-get <stream> :list) N).<stream>.hasMore()→(hs-stream-has-more <stream>).<tok>.type/.value/.op→(hs-token-type/value/op? <tok>).expect(X).toBe(V)andexpect(X).toEqual({...})→assert=.try { ... } catch (e) { errors.push(e.message) }plusexpect(msg).toMatch(/pat/)→(assert (regex-match? pat (guard-msg (hs-stream-consume s)))). A tinyguard-msghelper runs the expr underguardand returns the caught error's message.
The generator should emit a new deftest prologue:
(deftest "<name>"
(let ((s1 (hs-tokens-of "<src1>"))
(s2 (hs-tokens-of "<src2>" :template)))
(assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
...))
When the test builds a results object/array of {type, value}
dicts, emit one assert= per field instead of materialising a dict —
simpler to debug when it fails. toEqual({type: "X", value: "Y"})
becomes two assert= lines.
The generator continues to bail (return None / emit
SKIP (untranslated)) if any unrecognised JS shape appears; the 17
bodies all fit the grammar above.
6. Test delta estimate
| # | Test | Feasible? | Blockers |
|---|---|---|---|
| 1 | handles $ in template properly | yes | templateMode impl |
| 2 | handles all special escapes | yes | extend read-string escapes (+4 cases) |
| 3 | handles basic token types | yes | type-map + scientific-notation float (already in read-number? verify) |
| 4 | handles class identifiers | yes | type-map + .list[i] access |
| 5 | handles comments properly | yes | type-map; // comments already handled, -- not — add |
| 6 | handles hex escapes | yes | new \xNN escape + structured error |
| 7 | handles id references | yes | mirror of 4 |
| 8 | handles identifiers properly | yes | type-map only |
| 9 | handles identifiers with numbers | yes | type-map only |
| 10 | handles look ahead property | yes | EOF sentinel with "<<<EOF>>>" value |
| 11 | handles numbers properly | yes | fix 1.1.1 scan (stop at second dot); already appears OK |
| 12 | handles operators properly | yes | op-name map, ==/===/<=/>= lookahead |
| 13 | handles strings properly | yes | structured unterminated-string error |
| 14 | handles strings properly 2 | yes | subset of 13 |
| 15 | handles template bootstrap | yes | templateMode lexical emission |
| 16 | handles whitespace properly | yes | type-map only |
| 17 | string interpolation isnt surprising | already-translatable; needs read-template \$/\${ escape |
Expected: +16 to +17. Test 17 is already runnable (it is the one non-eval-only case) but depends on template-escape handling that lives in the same commit.
7. Risks / open questions
- Position tracking. The tokenizer currently stores
:pos P. Tests do not read it, so we leave it alone. E38 (SourceInfo API) will addstart/end/line/column; when that lands,hs-raw->api-tokenshould copy those through. - Template mode churn. Introducing
hs-tokenize-templaterisks divergence from the main tokenizer. Mitigation: factor shared scan helpers (whitespace, identifier, operator dispatch) into named functions both variants call; keep the template variant a thin wrapper that only overrides the backtick handler. - Keyword vs identifier type. The current code tags reserved words
as
"keyword"; upstream tags every bare word asIDENTIFIER. The conformance tests always expectIDENTIFIER. Mapping both"keyword"and"ident"to"IDENTIFIER"in the API layer is safe and does not affect the parser, which consumes the raw stream, not the API stream. - Mutable streams. The API stream is intentionally mutable (cursor
advances on
consumeToken). SX dicts are mutable viadict-set!today; this is consistent with the rest of the hyperscript runtime, which uses mutable dicts inhs-activate!and the event loop. - Do any existing tests depend on token shape?
parser.sxreads:type :value :pos. It must not see the API-shaped dicts. The API is strictly additive —hs-tokenizeis unchanged;hs-parsecontinues to consume its output directly. Onlyhs-api-tokens(and its consumers) sees the upstream-shaped dicts. - Error-message contract. Upstream throws on unterminated strings
and bad hex escapes. We currently return an EOF and emit a
trailing fragment. Adding a thrown error is new behaviour; confirm
the parser callers in
hs-compilestill produce useful diagnostics when the tokenizer raises rather than eats the input. .listindexing semantics. Upstream tests read.list[3]and.list[4]directly — these indices reference upstream's raw token layout. If our SX tokenizer emits a slightly different layout (e.g. extra whitespace-related tokens, or none where upstream has one), the index tests fail even though.type/.valueare correct. Verify on a spike before committing: run(hs-tokens-of "(a).a")and check that index 4 is theCLASS_REF. If indices disagree, add a normalization pass that strips tokens upstream omits.
8. Implementation checklist
Ordered smallest-first; each is its own commit.
- Add
hs-api-tokensand token helpers (lib/hyperscript/runtime.sx). Includeshs-raw->api-token, type-map, op-name table,hs-stream-token/consume/has-more, EOF sentinel with"<<<EOF>>>"value. No test delta yet — API-only. - Extend string-escape table in
read-string(tokenizer): add\b \f \r \v \xNN, keep existing\n \t \\ <quote>. Emit structured error message"Invalid hexadecimal escape: ..."or"Unterminated string". Unlocks tests 2, 6, 13, 14. - Add
==/===/<=/>=lookahead in tokenizer scan!. Currently only[=!<>]=is matched. Unlocks test 12. - Add
--line-comment support to scan!. Currently only//(through selector disambiguation) is handled. Unlocks test 5. - Add
hs-tokenize-templatevariant for template-bootstrap lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15. - Generator pattern in
tests/playwright/generate-sx-tests.py: recognise_hyperscript.internals.tokenizer.tokenize(src[, true])- consumer chain, emit SX
deftestusing the helpers from step 1. Unlocks the 16 remaining eval-only tests (test 17 already has DOM shape).
- consumer chain, emit SX
- Regenerate
spec/tests/test-hyperscript-behavioral.sxand runmcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer"). Expected: 17/17, with test 17 also passing thanks to step 2's escape fixes (it depends on\$/\${inread-template). - Update
plans/hs-conformance-to-100.mdrow 37 todone (+17)and tick the scoreboard in the same commit.
Work stays inside lib/hyperscript/**, shared/static/wasm/sx/hs-*,
tests/playwright/generate-sx-tests.py, and the two plan files —
matching the scope rule in the conformance plan. shared/static/wasm/ sx/hs-runtime.sx must be re-copied after each runtime edit.