rose-ash/plans/designs/e37-tokenizer-api.md

# E37 — Tokenizer-as-API

Cluster 37 of `plans/hs-conformance-to-100.md`. 17 tests in
`hs-upstream-core/tokenizer`. All 17 are emitted as `SKIP
(untranslated)` by `tests/playwright/generate-sx-tests.py`: the JS
bodies call `_hyperscript.internals.tokenizer.tokenize(...)` and
inspect a token-stream surface the SX port does not expose.

Work breaks into: (1) an SX API over the existing `hs-tokenize`
mimicking the upstream stream object; (2) a compatibility shim over
token fields; (3) a generator pattern recognising
`_hyperscript.internals.tokenizer.tokenize(src[, templateMode])`. No
tokenizer-grammar rewrite is required. Position tracking
(`start/end/line/column`) is scoped to E38 (SourceInfo API).

## 1. Failing tests

Every eval-only test calls `_hyperscript.internals.tokenizer.tokenize`
plus one or more of `.token(i)`, `.consumeToken()`, `.hasMore()`,
`.list`, `.type`, `.value`, `.op`.

1. **handles $ in template properly** — `tokenize('"', true).token(0).value` → `'"'`. templateMode + `token(i)`.
2. **handles all special escapes** — 6 × `tokenize('"\\X"').consumeToken().value` for `\b \f \n \r \t \v`.
3. **handles basic token types** — 15 asserts for `IDENTIFIER NUMBER CLASS_REF ID_REF STRING`; includes `1e6`, `1e-6`, `1.1e6`, `1.1e-6`; plus `.hasMore()`.
4. **handles class identifiers** — 9 `.a`-style; uses `.consumeToken()` and `.list[3]/.list[4]`.
5. **handles comments properly** — 13 asserts on `tokenize(src).list.length`; `--` / `//` to EOL emit nothing.
6. **handles hex escapes** — 3 `\\xNN` decodes + 4 error-path asserts matching `/Invalid hexadecimal escape/`.
7. **handles id references** — mirror of 4 for `#a` → `ID_REF`.
8. **handles identifiers properly** — whitespace + comment skipping between multiple `consumeToken()` calls.
9. **handles identifiers with numbers** — `f1oo / fo1o / foo1` → `IDENTIFIER`.
10. **handles look ahead property** — `tokenize("a 1 + 1").token(0..4)` → `["a" "1" "+" "1" "<<<EOF>>>"]`.
11. **handles numbers properly** — 8 asserts incl. `1.1.1` → `NUMBER PERIOD NUMBER`.
12. **handles operators properly** — iterates 27 ops (`+ - * . \\ : % | ! ? # & ; , ( ) < > { } [ ] = <= >= == ===`) asserting `token.op === true` and `token.value === key`.
13. **handles strings properly** — single/double quotes, embedded other-quote, escaped same-quote, + two unterminated throws matching `/Unterminated string/`.
14. **handles strings properly 2** — subset of 13.
15. **handles template bootstrap** — 5 `tokenize(src, true)` cases asserting the lexical char-level stream (`"`, `$`, `{`, inner, `}`, `"`).
16. **handles whitespace properly** — 16 asserts on `.list.length` for space / `\n` / `\r` / `\t`.
17. **string interpolation isnt surprising** — DOM-shaped (not eval-only); asserts `\$`/`\${` escapes in templates. Touches `read-template`, not the stream API.

## 2. Upstream API shape

From `https://hyperscript.org/docs/#api` and
`node_modules/hyperscript.org/src/_hyperscript.js`:

```js
const tokens = _hyperscript.internals.tokenizer.tokenize(src, templateMode?)
//   → { list, source, hasMore, matchTokenType, token, consumeToken,
//       requireTokenType, ... }
tokens.list           // Array<Token> — lookahead window
tokens.source         // original src string
tokens.token(i)       // i-th un-consumed token (0 = current); returns
                      //   { type: "EOF", value: "<<<EOF>>>" } past end
tokens.consumeToken() // shift + return; throws on empty for required
tokens.hasMore()      // true if a non-EOF token remains
tokens.matchTokenType(type) / requireTokenType(type) / etc.
```

Each `Token` is:

```js
{
  type:   "IDENTIFIER" | "NUMBER" | "STRING" | "CLASS_REF"
        | "ID_REF" | "EOF" | "PLUS" | "MINUS" | ... /* op names */,
  value:  string,
  op:     boolean,   // true for punctuation/operator tokens
  start:  number,    // char offset
  end:    number,
  line:   number,
  column: number,
  source: string,    // reference to full src
}
```

The conformance tests only read `type`, `value`, `op`, and occasionally
random-index into `.list`. They never read `start/end/line/column`, so
position tracking is **not** required for cluster E37.

## 3. Proposed SX surface

Add three things to `lib/hyperscript/runtime.sx` (exposed by name, so
SX test bodies can call them directly through `eval-hs` or `assert=`):

```
(hs-tokens-of src)              ; => dict — new token-stream object
(hs-tokens-of src :template)    ; templateMode variant
(hs-token-type tok)             ; upstream-style type name
(hs-token-value tok)            ; string value
(hs-token-op? tok)              ; bool
```

A token stream is a mutable dict:

```
{ :source  src
  :list    (list-of-tokens)   ; upstream-shaped, :type :value :op
  :pos     0 }                ; cursor into :list
```

With three pure-SX consumer helpers:

```
(hs-stream-token  stream i)   ; lookahead; returns EOF sentinel past end
(hs-stream-consume stream)    ; returns current token, advances :pos
(hs-stream-has-more stream)   ; not EOF and pos < len
```

### Worked example

```
(let ((s (hs-tokens-of "1.1")))
  (hs-token-type (hs-stream-consume s)))        ; => "NUMBER"

(let ((s (hs-tokens-of "a 1 + 1")))
  (list (hs-token-value (hs-stream-token s 0))   ; "a"
        (hs-token-value (hs-stream-token s 4)))) ; "<<<EOF>>>"
```

All helpers are ordinary `define`s — no platform primitives, no FFI.
The generator emits them as bare calls inside `deftest` bodies.

## 4. Runtime architecture

The existing `hs-tokenize` emits tokens with:

```
{ :type  "keyword" | "ident" | "number" | "string" | "class" | "id"
       | "op" | "paren-open" | ... | "eof"
  :value V
  :pos   P }
```

The upstream contract uses `SCREAMING_SNAKE_CASE` and a dedicated
boolean `.op` flag rather than a merged type/punctuation taxonomy.
Rather than rewrite the tokenizer, add a translation layer.

### Type map (SX-native → upstream)

```
"ident"         → "IDENTIFIER"           (keywords too: see note)
"keyword"       → "IDENTIFIER"           (upstream tokenizes keywords as idents)
"number"        → "NUMBER"
"string"        → "STRING"
"class"         → "CLASS_REF"            (:value becomes ".a" with leading dot)
"id"            → "ID_REF"               (:value becomes "#a" with leading hash)
"attr"          → "ATTRIBUTE_REF"
"style"         → "STYLE_REF"
"selector"      → "QUERY_REF"            (used by tests? upstream calls it QUERY_REF)
"template"      → one-shot: see templateMode below
"eof"           → "EOF"   with :value "<<<EOF>>>"
"paren-open"    → "L_PAREN"     + :op true
"paren-close"   → "R_PAREN"     + :op true
"bracket-open"  → "L_BRACKET"   + :op true
"bracket-close" → "R_BRACKET"   + :op true
"brace-open"    → "L_BRACE"     + :op true
"brace-close"  → "R_BRACE"      + :op true
"comma"         → "COMMA"       + :op true
"dot"           → "PERIOD"      + :op true
"op"            → name-by-value lookup (see below) + :op true
```

A tiny op-name table (15–25 entries) maps `:value` strings to the
upstream token type name:

```
"+"   → "PLUS"
"-"   → "MINUS"
"*"   → "MULTIPLY"
"/"   → "SLASH"        ; current code uses "op"/"/"
":"   → "COLON"        ; not yet emitted as own token — fix below
"%"   → "PERCENT"
"|"   → "PIPE"
"!"   → "EXCLAMATION"
"?"   → "QUESTION"
"#"   → "POUND"
"&"   → "AMPERSAND"
";"   → "SEMI"
"="   → "EQUALS"
"<"   → "L_ANG"
">"   → "R_ANG"
"<="  → "LTE_ANG"
">="  → "GTE_ANG"
"=="  → "EQ"
"===" → "EQQ"
"\\"  → "BACKSLASH"
"'s"  → "APOSTROPHE_S" ; not a true operator — elided from test 12
```

### Conversion entry point

```
(define (hs-api-tokens src template-mode?)
  (let ((raw (if template-mode?
                 (hs-tokenize-template src)    ; new variant
                 (hs-tokenize src))))
    {:source  src
     :list    (map hs-raw->api-token raw)
     :pos     0}))
```

`hs-raw->api-token` is a pure mapping function using the tables above.
An EOF token is always present at the end (the current tokenizer
already emits one).

### Token gaps to fix

Three issues turn up while writing the map; all are trivial one-site
fixes in `tokenizer.sx`:

- **`:` is currently consumed as part of the local prefix
  (`:name`)**. Upstream tests expect bare `:` alone to produce
  `COLON`; only when followed by `ident-start` does it combine. The
  test suite does not exercise the bare form (it is only covered by
  the operator table in test 12). Fix by emitting `"op" ":"` when the
  next char is not an ident start — already what the code does; the
  op-name map above covers it.
- **`===` and `==`** — current tokenizer emits `"op" "="` plus another
  `"="`, not `"=="`. Extend the `=`/`!`/`<`/`>` lookahead clause to
  also match a third `=` after `==`.
- **Template mode** — upstream `tokenize(src, true)` splits
  backtick-templates into their lexical parts rather than the single
  `"template"` token the current code emits. Add a second top-level
  scanner `hs-tokenize-template` used only for the API wrapper; the
  primary parser continues to call `hs-tokenize` unchanged. The
  template-mode tests (1, 15) only require character-level emission
  of the `" $ { inner } "` sequence — no semantic re-use by the
  parser.

### Stream consumer helpers

```
(define (hs-stream-token s i)
  (let ((list (dict-get s :list))
        (pos  (dict-get s :pos)))
    (or (nth list (+ pos i))
        (hs-eof-sentinel))))

(define (hs-stream-consume s)
  (let ((tok (hs-stream-token s 0)))
    (when (not (= (hs-token-type tok) "EOF"))
      (dict-set! s :pos (+ (dict-get s :pos) 1)))
    tok))

(define (hs-stream-has-more s)
  (not (= (hs-token-type (hs-stream-token s 0)) "EOF")))
```

## 5. Test mock strategy

All 17 tests are `complexity: eval-only` with empty `html`. They do
not need the DOM runner — they only need SX expressions that resolve
to the same values the JS asserts check.

Add a generator pattern to `generate-sx-tests.py`, slotted into
`generate_eval_only_test` or as a new pre-pass ahead of it, that
matches bodies containing `_hyperscript.internals.tokenizer.tokenize`.
The pattern tree, by precedence:

1. `tokenize(SRC[, true])` → emit an SX `let` that binds a
   fresh stream name to `(hs-tokens-of SRC [:template])`.
2. `<stream>.consumeToken()` → `(hs-stream-consume <stream>)`.
3. `<stream>.token(N)` → `(hs-stream-token <stream> N)`.
4. `<stream>.list` → `(dict-get <stream> :list)`.
5. `<stream>.list.length` → `(len (dict-get <stream> :list))`.
6. `<stream>.list[N]` → `(nth (dict-get <stream> :list) N)`.
7. `<stream>.hasMore()` → `(hs-stream-has-more <stream>)`.
8. `<tok>.type` / `.value` / `.op` → `(hs-token-type/value/op? <tok>)`.
9. `expect(X).toBe(V)` and `expect(X).toEqual({...})` → `assert=`.
10. `try { ... } catch (e) { errors.push(e.message) }` plus
    `expect(msg).toMatch(/pat/)` → `(assert (regex-match? pat (guard-msg (hs-stream-consume s))))`.
    A tiny `guard-msg` helper runs the expr under `guard` and returns
    the caught error's message.

The generator should emit a new deftest prologue:

```
  (deftest "<name>"
    (let ((s1 (hs-tokens-of "<src1>"))
          (s2 (hs-tokens-of "<src2>" :template)))
      (assert= (hs-token-type (hs-stream-consume s1)) "NUMBER")
      ...))
```

When the test builds a `results` object/array of `{type, value}`
dicts, emit one `assert=` per field instead of materialising a dict —
simpler to debug when it fails. `toEqual({type: "X", value: "Y"})`
becomes two `assert=` lines.

The generator continues to bail (`return None` / emit
`SKIP (untranslated)`) if any unrecognised JS shape appears; the 17
bodies all fit the grammar above.

## 6. Test delta estimate

| # | Test | Feasible? | Blockers |
|---|------|-----------|----------|
| 1 | handles $ in template properly | yes | templateMode impl |
| 2 | handles all special escapes | yes | extend `read-string` escapes (+4 cases) |
| 3 | handles basic token types | yes | type-map + scientific-notation float (already in `read-number`? verify) |
| 4 | handles class identifiers | yes | type-map + `.list[i]` access |
| 5 | handles comments properly | yes | type-map; `//` comments already handled, `--` not — add |
| 6 | handles hex escapes | yes | new `\xNN` escape + structured error |
| 7 | handles id references | yes | mirror of 4 |
| 8 | handles identifiers properly | yes | type-map only |
| 9 | handles identifiers with numbers | yes | type-map only |
| 10 | handles look ahead property | yes | EOF sentinel with `"<<<EOF>>>"` value |
| 11 | handles numbers properly | yes | fix `1.1.1` scan (stop at second dot); already appears OK |
| 12 | handles operators properly | yes | op-name map, `==`/`===`/`<=`/`>=` lookahead |
| 13 | handles strings properly | yes | structured unterminated-string error |
| 14 | handles strings properly 2 | yes | subset of 13 |
| 15 | handles template bootstrap | yes | templateMode lexical emission |
| 16 | handles whitespace properly | yes | type-map only |
| 17 | string interpolation isnt surprising | already-translatable; needs `read-template` `\$`/`\${` escape |

Expected: **+16 to +17**. Test 17 is already runnable (it is the one
non-eval-only case) but depends on template-escape handling that lives
in the same commit.

## 7. Risks / open questions

- **Position tracking.** The tokenizer currently stores `:pos P`. Tests
  do not read it, so we leave it alone. E38 (SourceInfo API) will add
  `start/end/line/column`; when that lands, `hs-raw->api-token` should
  copy those through.
- **Template mode churn.** Introducing `hs-tokenize-template` risks
  divergence from the main tokenizer. Mitigation: factor shared scan
  helpers (whitespace, identifier, operator dispatch) into named
  functions both variants call; keep the template variant a thin
  wrapper that only overrides the backtick handler.
- **Keyword vs identifier type.** The current code tags reserved words
  as `"keyword"`; upstream tags every bare word as `IDENTIFIER`. The
  conformance tests always expect `IDENTIFIER`. Mapping both
  `"keyword"` and `"ident"` to `"IDENTIFIER"` in the API layer is
  safe and does **not** affect the parser, which consumes the raw
  stream, not the API stream.
- **Mutable streams.** The API stream is intentionally mutable (cursor
  advances on `consumeToken`). SX dicts are mutable via `dict-set!`
  today; this is consistent with the rest of the hyperscript runtime,
  which uses mutable dicts in `hs-activate!` and the event loop.
- **Do any existing tests depend on token shape?** `parser.sx` reads
  `:type :value :pos`. It must **not** see the API-shaped dicts. The
  API is strictly additive — `hs-tokenize` is unchanged; `hs-parse`
  continues to consume its output directly. Only `hs-api-tokens`
  (and its consumers) sees the upstream-shaped dicts.
- **Error-message contract.** Upstream throws on unterminated strings
  and bad hex escapes. We currently return an EOF and emit a
  trailing fragment. Adding a thrown error is new behaviour; confirm
  the parser callers in `hs-compile` still produce useful diagnostics
  when the tokenizer raises rather than eats the input.
- **`.list` indexing semantics.** Upstream tests read `.list[3]` and
  `.list[4]` directly — these indices reference upstream's raw
  token layout. If our SX tokenizer emits a slightly different
  layout (e.g. extra whitespace-related tokens, or none where
  upstream has one), the index tests fail even though `.type`/`.value`
  are correct. Verify on a spike before committing: run
  `(hs-tokens-of "(a).a")` and check that index 4 is the
  `CLASS_REF`. If indices disagree, add a normalization pass that
  strips tokens upstream omits.

## 8. Implementation checklist

Ordered smallest-first; each is its own commit.

1. **Add `hs-api-tokens` and token helpers** (`lib/hyperscript/runtime.sx`).
   Includes `hs-raw->api-token`, type-map, op-name table,
   `hs-stream-token/consume/has-more`, EOF sentinel with
   `"<<<EOF>>>"` value. No test delta yet — API-only.
2. **Extend string-escape table** in `read-string` (tokenizer):
   add `\b \f \r \v \xNN`, keep existing `\n \t \\ <quote>`. Emit
   structured error message `"Invalid hexadecimal escape: ..."` or
   `"Unterminated string"`. Unlocks tests 2, 6, 13, 14.
3. **Add `==` / `===` / `<=` / `>=` lookahead** in tokenizer scan!.
   Currently only `[=!<>]=` is matched. Unlocks test 12.
4. **Add `--` line-comment support** to scan!. Currently only `//`
   (through selector disambiguation) is handled. Unlocks test 5.
5. **Add `hs-tokenize-template`** variant for template-bootstrap
   lexical mode. Shared scan helpers extracted. Unlocks tests 1, 15.
6. **Generator pattern** in `tests/playwright/generate-sx-tests.py`:
   recognise `_hyperscript.internals.tokenizer.tokenize(src[, true])`
   + consumer chain, emit SX `deftest` using the helpers from step 1.
   Unlocks the 16 remaining eval-only tests (test 17 already has DOM
   shape).
7. **Regenerate `spec/tests/test-hyperscript-behavioral.sx`** and run
   `mcp__hs-test__hs_test_run(suite="hs-upstream-core/tokenizer")`.
   Expected: 17/17, with test 17 also passing thanks to step 2's
   escape fixes (it depends on `\$` / `\${` in `read-template`).
8. **Update** `plans/hs-conformance-to-100.md` row 37 to
   `done (+17)` and tick the scoreboard in the same commit.

Work stays inside `lib/hyperscript/**`, `shared/static/wasm/sx/hs-*`,
`tests/playwright/generate-sx-tests.py`, and the two plan files —
matching the scope rule in the conformance plan. `shared/static/wasm/
sx/hs-runtime.sx` must be re-copied after each runtime edit.