Files
rose-ash/plans/designs/e38-sourceinfo.md
giles 67d4b9dae5 HS-design: E38 SourceInfo API
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:08:02 +00:00

145 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# E38 — SourceInfo API (design)
Cluster 38 of `plans/hs-conformance-to-100.md`. Goal: 4 tests in `hs-upstream-core/sourceInfo` that exercise `_hyperscript.parse(src).sourceFor()` and `.lineFor()`.
Upstream reference: `/tmp/hs-upstream/test/core/sourceInfo.js`, `/tmp/hs-upstream/src/parsetree/base.js` (29 lines of impl — `sourceFor()` slices `programSource[startToken.start..endToken.end]`, `lineFor()` returns `programSource.split("\n")[startToken.line-1]`).
## 1. Failing tests
All four currently `SKIP (untranslated)` (lines 24342442 of `spec/tests/test-hyperscript-behavioral.sx`).
| # | Test name | What it asserts |
|---|-----------|-----------------|
| 1 | `debug` | `parse("<button.foo/>").sourceFor() == "<button.foo/>"` — single-token round-trip. |
| 2 | `get source works for expressions` | 7 separate `parse(…).sourceFor()` checks over `1`, `a.b`, `a.b()`, `<button.foo/>`, `x + y`, `'foo'`, `.foo`, `#bar`. Also navigates: `elt.root.sourceFor()``"a"` for `"a.b"`; `elt.root.root` for `"a.b()"`; `elt.lhs`/`elt.rhs` for `"x + y"`. |
| 3 | `get source works for statements` | `if true log 'it was true'` and `for x in [1, 2, 3] log x then log x end` each round-trip through `sourceFor()`. |
| 4 | `get line works for statements` | `parse("if true\n log 'it was true'\n log 'it was true'")``elt.lineFor()``"if true"`, `elt.trueBranch.lineFor()``" log 'it was true'"`, `elt.trueBranch.next.lineFor()``" log 'it was true'"`. |
Key demand: the AST must (a) retain a `{start, end, line}` span per node; (b) expose navigable sub-nodes (`root`, `lhs`, `rhs`, `trueBranch`, `next`); (c) provide `sourceFor`/`lineFor` keyed off the original program source.
## 2. Proposed API
User-visible surface, kept minimal:
```
(hs-parse-ast "SRC") ; → parsed node (an AST handle, see §3)
(hs-source-for NODE) ; → substring of original source
(hs-line-for NODE) ; → full source line containing NODE's start
(hs-node-get NODE KEY) ; → child AST node at field (root / lhs / rhs / true-branch / next …)
```
`NODE` is a **parsed-but-uncompiled** AST. It is not a compiled handler, not a runtime event. The upstream API mirrors this: `_hyperscript.parse(src)` returns a parse tree, never a closure. Keeping the feature scoped to parser output avoids retro-fitting spans onto bytecode or closures.
For the generator's benefit we expose two thin helpers at the test layer only:
```
(hs-src src) ; = (hs-source-for (hs-parse-ast src))
(hs-src-at src field-path) ; = walk (hs-node-get … key) then source-for
```
We do **not** add `(get line thing)` as a DSL keyword. That phrase in the plan row was shorthand — the tests actually call host methods `.sourceFor()` / `.lineFor()`, not hyperscript statements. Keeping this out of the HS grammar keeps the surface area near zero.
## 3. Attach strategy
The tokenizer and parser already have the raw material; the information is dropped at two points.
### Walk-through
| Stage | File | State today | Change |
|-------|------|-------------|--------|
| Tokenize | `lib/hyperscript/tokenizer.sx` | Tokens are `{:type T :value V :pos P}`. Only `start` offset tracked; no `end`, no `line`. | Extend `hs-make-token``{:type :value :pos :end :line}`. Track a `current-line` counter in `hs-tokenize` that increments on `\n`. `:end` = index after last consumed char. |
| Parse | `lib/hyperscript/parser.sx` | `hs-parse` takes `(tokens src)`, returns bare SX lists/symbols. Source offsets are consumed internally (see `collect-sx-source` at path `(0 2 2 69)`) but never stored on the output AST. | For every production that returns a node, attach a span dict: wrap the output in `{:hs-ast true :kind … :start START :end END :line LINE :src SRC :children CHILDREN :fields FIELDS}`. Children preserve the SX list an `hs-compile` downstream currently consumes; `fields` is a small dict mapping `:root :lhs :rhs :true-branch :next …` to sub-nodes. |
| Compile | `lib/hyperscript/compiler.sx` | `hs-to-sx` consumes the bare list AST and emits runtime calls. | Add a thin unwrap step at the entry: if the AST is a span-wrapped dict, pull `:children` (or equivalent raw list) and continue. No per-production rewiring — the wrapped form passes through unchanged for every existing callsite. |
| Runtime | `lib/hyperscript/runtime.sx` | Compiled code never sees AST nodes. | No change. SourceInfo lives on the parse tree, not on compiled handlers. |
### Side-channel vs inline
**Inline wrapper dict is the cheaper option**, because:
- Parser output is already heterogeneous (lists, symbols, strings, numbers). A dict wrapper is distinguishable by `(dict? x)` + `(dict-get x :hs-ast)` — no risk of collision.
- A side-channel `(map node → span)` would need identity semantics, and SX lists don't have stable identity after any structural transform. We would end up cloning everything.
- The compiler's existing `hs-to-sx` dispatch is on `(first ast)`. The unwrap step is a single `cond` branch at its top.
### Field dictionary
The parser emits nodes in many shapes. `:fields` names a handful of them so `hs-node-get` can navigate without the caller learning SX shape. Mapping (from the upstream tests):
| Upstream accessor | Our field key | Produced by |
|-------------------|---------------|-------------|
| `.root` | `:root` | symbol-with-member / call expressions (`a.b`, `a.b()`). For `a.b` the root is `a`; for `a.b()` the root is `a.b`. |
| `.lhs` / `.rhs` | `:lhs` / `:rhs` | binary operators (`x + y`). |
| `.trueBranch` | `:true-branch` | `if` command; the first command in the consequent. |
| `.next` | `:next` | any command; the following command in a `CommandList`. |
Only these four fields are needed for the 4 tests. Others are deferred.
### Span capture
The parser already tracks start offsets via its token cursor; `collect-sx-source` shows the end-substring pattern. Pattern for every production:
```
(let ((start (current-pos))
(start-line (current-line)))
(let ((raw (… existing production …)))
(let ((end (previous-pos)))
(hs-ast-wrap raw :kind "…" :start start :end end :line start-line :src src))))
```
Two tiny helpers (`current-pos`, `current-line`) added to the parser's inner `let` scope. `hs-ast-wrap` lives alongside `collect-sx-source`.
## 4. Test mock / generator strategy
Add one pattern to `tests/playwright/generate-sx-tests.py` (cluster: sourceInfo). Recognise:
```js
_hyperscript.parse("SRC").sourceFor() (hs-src "SRC")
_hyperscript.parse("SRC").root.sourceFor() (hs-src-at "SRC" (list :root))
_hyperscript.parse("SRC").root.root.sourceFor() (hs-src-at "SRC" (list :root :root))
_hyperscript.parse("SRC").lhs.sourceFor() (hs-src-at "SRC" (list :lhs))
_hyperscript.parse("SRC").rhs.sourceFor() (hs-src-at "SRC" (list :rhs))
_hyperscript.parse("SRC").lineFor() (hs-line-at "SRC" (list))
_hyperscript.parse("SRC").trueBranch.lineFor() (hs-line-at "SRC" (list :true-branch))
_hyperscript.parse("SRC").trueBranch.next.lineFor() (hs-line-at "SRC" (list :true-branch :next))
```
Object-returning patterns (`return { src: …, rootSrc: … }`) become one `assert=` per member. The generator already has the newline escaping infrastructure for string bodies (cluster 17 etc. exercised it).
No mock-DOM changes required — SourceInfo does not touch the DOM. `hs-cleanup!` is unused here.
## 5. Test-delta estimate
| Test | Sub-assertions | Blockers today | Delta |
|------|----------------|----------------|-------|
| `debug` | 1 | Parser must accept `<button.foo/>` as a full expression (already does — it's a CSS-literal). Needs `sourceFor`. | +1 |
| `get source works for expressions` | ~9 | Adds binary operator span (`x + y`) and nested-member navigation (`.root.root`). | +1 (one test, all assertions must pass) |
| `get source works for statements` | 2 | Needs statement-level span; `if … log …` and `for … end` already parse. | +1 |
| `get line works for statements` | 3 | Needs `:line`, `:true-branch`, `:next` field navigation, and the `lineFor` semantics (newline-indexed string split, not just the node's own source slice). | +1 |
Total: **+4** (matches the plan's cluster row).
## 6. Risks
- **AST equality.** Wrapping every parser node in a dict changes `equal?` semantics for any caller that does structural comparison on AST output. Mitigation: the compiler's entry unwrap means all downstream code sees the bare form. Only new `hs-parse-ast` callers see the wrapped form. Direct `hs-parse`/`hs-compile`/`hs-to-sx-from-source` keep their existing return shape.
- **Serialisation.** If AST nodes are ever sent over the wire (they are not today, but the `spec/tests` runner serialises results for error printing), the wrapper dict grows the payload. Mitigation: keep `:src` as a reference to the shared program source string (one copy) rather than slicing per node; SX dicts share values.
- **Memory.** One extra dict per node. The parser currently allocates a list per node; we double that. For the largest test program (`for x in [1, 2, 3] log x then log x end`) this is ~15 nodes. Negligible.
- **`lineFor` off-by-one.** Upstream uses `programSource.split("\n")[startToken.line - 1]` and counts lines from 1. Our `current-line` must mirror exactly — increment *after* `\n`, first line is `1`. Unit-test the tokenizer on the `"if true\n log …\n log …"` fixture before wiring the parser.
- **Operator associativity and `.root`.** Upstream's `a.b()` gives `.root = (a.b)` and `.root.root = a`. Our parser must record the callee sub-expression as `:root` of a call node, and the receiver as `:root` of a member node. A one-liner slip here would fail test 2 silently.
## 7. Implementation checklist
Four commits. Each commit passes the baseline smoke range (0195) before moving on.
1. **Tokenizer: add `:end` and `:line` to tokens.** Extend `hs-make-token`; track `current-line` in `hs-tokenize`; update every emission site (there are ~20). No parser changes yet. Unit-test via a small ad-hoc `deftest` in the tokenizer's own test fixture (or inline in `behavioral.sx` under a throwaway suite — remove before commit). Commit: `HS: tokenizer tracks :end and :line`.
2. **Parser: wrap output nodes with span dict + fields.** Introduce `hs-ast-wrap`, `current-pos`, `current-line`. Wrap expression and statement productions. Populate `:root :lhs :rhs :true-branch :next` for the handful of node shapes the tests exercise. Add entry-unwrap to `hs-to-sx` so downstream consumers are unaffected. Commit: `HS: parser attaches source spans to AST nodes`.
3. **API: `hs-parse-ast`, `hs-source-for`, `hs-line-for`, `hs-node-get` + test helpers `hs-src`, `hs-src-at`, `hs-line-at`.** Thin functions. Place `hs-parse-ast` in `parser.sx`, accessors in `runtime.sx` (so they're auto-loaded by the behavioral runner), helpers inline in `test-hyperscript-behavioral.sx` via the generator. Commit: `HS: sourceInfo API (sourceFor / lineFor / node-get)`.
4. **Generator: sourceInfo pattern + regenerate 4 tests.** Add the pattern matchers from §4 to `generate-sx-tests.py`. Regenerate `spec/tests/test-hyperscript-behavioral.sx`. Verify `hs-upstream-core/sourceInfo` goes from 0/4 to 4/4 and no regression in the 0195 smoke range. Remember: `cp lib/hyperscript/<f>.sx shared/static/wasm/sx/hs-<f>.sx` after each `.sx` touch. Commit: `HS: sourceInfo (+4 tests)`.
## Notes
- No runtime changes. SourceInfo is purely a parser-side facility.
- No changes to the HS DSL grammar. `get line` / `get source` are *not* added as hyperscript keywords — the upstream test file exclusively calls host-side methods on parse-tree objects.
- Upstream's impl is 7 lines of host JS. Ours lands in about 30 lines of SX plus a generator pattern.