blogimport: lexical->persist genesis-import + at-rest parity verifier (55/55)
Some checks failed
Test, Build, and Deploy / test-build-deploy (push) Failing after 1m9s

Implements plans/migration/data-migration.md (the un-started long-pole) and the
data-layer half of slice-01-blog §4. Host-ops migration module composing
content-on-sx + persist public APIs; isolated from lib/host and lib/content.

- lexical.sx: Ghost lexical (as SX dicts) -> content block list, deterministic ids
- import.sx: genesis import into content:<id> op-log, idempotent, + postmeta stream
- verify.sx: replay-and-diff vs row-derived oracle (proves round-trip lossless)

Inline formatting flattens to plain text (Phase-5 runs swap-point isolated in
lex-inline-text); live Postgres source (Q-M4) + improved-converter re-import (Q-M5)
flagged in README. 55/55 conformance: lexical 23, import 21, verify 11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-30 13:14:30 +00:00
parent 1597eaa4f8
commit a4d93c61cc
10 changed files with 683 additions and 0 deletions

47
lib/blogimport/README.md Normal file
View File

@@ -0,0 +1,47 @@
# lib/blogimport — blog Postgres → persist genesis-import + parity verifier
Implements **`plans/migration/data-migration.md`** (the "long-pole nobody had
started") and the at-rest half of **`slice-01-blog.md` §4** — the data layer of the
blog read-path migration. Host-ops migration tooling, **not** a domain core: it
composes the public APIs of content-on-sx (`lib/content`) and persist
(`lib/persist`). Kept in its own module (not `lib/host`, not `lib/content`) so it
doesn't collide with the loops that own those.
Status: **machinery complete, 55/55 conformance** (lexical 23, import 21, verify 11).
## What it does
| Module | Role |
|---|---|
| `lexical.sx` | `blogimport/lex-blocks doc` — Ghost **lexical** body (as SX dicts) → content-on-sx **block list**, ids deterministic by position (`b0,b1,…`). |
| `import.sx` | `blogimport/import-post! b post at` — genesis import: convert the post's lexical, commit blocks as ordered `op-insert`s into the `content:<id>` op-log stream, record metadata in a sibling `postmeta:<id>` stream. Idempotent (skip-if-exists). `import-all!` → coverage scoreboard. |
| `verify.sx` | `blogimport/verify-post b post` — replay the stream → block model, diff vs the row-derived oracle with `=`. `verify-all``{:total :ok :mismatched}` coverage. |
## What is proven
The verifier holds **`lexical → import → persist → replay → block-model`** equal to
**`lexical → block-model`** computed directly. I.e. **the genesis import + op-log
replay is lossless** — "did the backfill corrupt anything" at rest
(`data-migration.md` §6). The `verify.sx` corruption test confirms a diverging stream
is *detected*, not silently passed.
## Known limitations / TODO (carry into the plan)
- **Inline formatting is flattened to plain text.** Architecture's content model holds
plain-string text (`mk-text id text`); Phase-5 rich inline runs are not merged here.
The single swap-point is `lex-inline-text` in `lexical.sx` — return runs there once
content-on-sx Phase 5 lands on `architecture`. Bold/italic/links currently collapse
to their plain concatenation (drift-proof, == `asText`). (slice-01-blog Q-B1.)
- **Oracle is the in-memory lexical→blocks, not the live Python block model.** This
proves round-trip fidelity through persist. The "does SX match Python" half of Q-D2
needs the **live source**: read real `Post` rows via the internal-data query
(`/internal/data/…`) or direct Postgres (**Q-M4**, undecided) and feed them as `post`
dicts. The diff plumbing here is the twin that step reuses.
- **Re-import with an improved converter (Q-M5)** is import-once today (skip-if-exists).
Superseding prior genesis events (vs truncate+re-import) is future work.
## Run
```bash
bash lib/blogimport/conformance.sh # 55/55; writes scoreboard.{json,md}
```

119
lib/blogimport/conformance.sh Executable file
View File

@@ -0,0 +1,119 @@
#!/usr/bin/env bash
# lib/blogimport/conformance.sh — run blog-import suites, emit scoreboard.
# Mirrors lib/content/conformance.sh: epoch-loaded modules + a bi-test counter.
set -uo pipefail
cd "$(git rev-parse --show-toplevel)"
SX_SERVER="hosts/ocaml/_build/default/bin/sx_server.exe"
if [ ! -x "$SX_SERVER" ]; then
MAIN_ROOT=$(git worktree list | head -1 | awk '{print $1}')
if [ -x "$MAIN_ROOT/$SX_SERVER" ]; then
SX_SERVER="$MAIN_ROOT/$SX_SERVER"
else
echo "ERROR: sx_server.exe not found." >&2
exit 1
fi
fi
SUITES=(lexical import verify)
OUT_JSON="lib/blogimport/scoreboard.json"
OUT_MD="lib/blogimport/scoreboard.md"
run_suite() {
local suite=$1
local file="lib/blogimport/tests/${suite}.sx"
[ -f "$file" ] || { echo "0 0"; return; }
local TMP
TMP=$(mktemp)
cat > "$TMP" << EPOCHS
(epoch 1)
(load "lib/smalltalk/tokenizer.sx")
(load "lib/smalltalk/parser.sx")
(load "lib/guest/reflective/class-chain.sx")
(load "lib/smalltalk/runtime.sx")
(load "lib/guest/reflective/env.sx")
(load "lib/smalltalk/eval.sx")
(load "lib/persist/event.sx")
(load "lib/persist/backend.sx")
(load "lib/persist/log.sx")
(load "lib/persist/kv.sx")
(load "lib/persist/api.sx")
(load "lib/content/block.sx")
(load "lib/content/doc.sx")
(load "lib/content/render.sx")
(load "lib/content/api.sx")
(load "lib/content/meta.sx")
(load "lib/content/section.sx")
(load "lib/content/callout.sx")
(load "lib/content/media.sx")
(load "lib/content/store.sx")
(load "lib/blogimport/lexical.sx")
(load "lib/blogimport/import.sx")
(load "lib/blogimport/verify.sx")
(epoch 2)
(eval "(define bi-test-pass 0)")
(eval "(define bi-test-fail 0)")
(eval "(define bi-test-fails (list))")
(eval "(define bi-test (fn (name got expected) (if (= got expected) (set! bi-test-pass (+ bi-test-pass 1)) (begin (set! bi-test-fail (+ bi-test-fail 1)) (set! bi-test-fails (cons name bi-test-fails))))))")
(epoch 3)
(load "${file}")
(epoch 4)
(eval "(list bi-test-pass bi-test-fail)")
EPOCHS
local OUTPUT
OUTPUT=$(timeout 240 "$SX_SERVER" < "$TMP" 2>/dev/null)
rm -f "$TMP"
local LINE
LINE=$(echo "$OUTPUT" | awk '/^\(ok-len 4 / {getline; print; exit}')
if [ -z "$LINE" ]; then
LINE=$(echo "$OUTPUT" | grep -E '^\(ok 4 \([0-9]+ [0-9]+\)\)' | tail -1 \
| sed -E 's/^\(ok 4 //; s/\)$//')
fi
local P F
P=$(echo "$LINE" | sed -E 's/^\(([0-9]+) ([0-9]+)\).*/\1/')
F=$(echo "$LINE" | sed -E 's/^\(([0-9]+) ([0-9]+)\).*/\2/')
echo "${P:-0} ${F:-0}"
}
declare -A SUITE_PASS SUITE_FAIL
TOTAL_PASS=0
TOTAL_FAIL=0
echo "Running blogimport conformance suite..." >&2
for s in "${SUITES[@]}"; do
read -r p f < <(run_suite "$s")
SUITE_PASS[$s]=$p
SUITE_FAIL[$s]=$f
TOTAL_PASS=$((TOTAL_PASS + p))
TOTAL_FAIL=$((TOTAL_FAIL + f))
printf " %-10s %d/%d\n" "$s" "$p" "$((p+f))" >&2
done
{
printf '{\n "suites": {\n'
first=1
for s in "${SUITES[@]}"; do
if [ $first -eq 0 ]; then printf ',\n'; fi
printf ' "%s": {"pass": %d, "fail": %d}' "$s" "${SUITE_PASS[$s]}" "${SUITE_FAIL[$s]}"
first=0
done
printf '\n },\n "total_pass": %d,\n "total_fail": %d,\n "total": %d\n}\n' \
"$TOTAL_PASS" "$TOTAL_FAIL" "$((TOTAL_PASS + TOTAL_FAIL))"
} > "$OUT_JSON"
{
printf '# blogimport Conformance Scoreboard\n\n_Generated by `lib/blogimport/conformance.sh`_\n\n'
printf '| Suite | Pass | Fail | Total |\n|-------|-----:|-----:|------:|\n'
for s in "${SUITES[@]}"; do
printf '| %s | %d | %d | %d |\n' "$s" "${SUITE_PASS[$s]}" "${SUITE_FAIL[$s]}" "$(( ${SUITE_PASS[$s]} + ${SUITE_FAIL[$s]} ))"
done
printf '| **Total** | **%d** | **%d** | **%d** |\n' "$TOTAL_PASS" "$TOTAL_FAIL" "$((TOTAL_PASS + TOTAL_FAIL))"
} > "$OUT_MD"
echo "Total: $TOTAL_PASS pass, $TOTAL_FAIL fail" >&2
[ "$TOTAL_FAIL" -eq 0 ]

84
lib/blogimport/import.sx Normal file
View File

@@ -0,0 +1,84 @@
; lib/blogimport/import.sx
; Genesis import: a blog Post row -> a persist content op-log stream.
;
; Per plans/migration/data-migration.md §3-5: for each Post, convert its lexical
; body to content blocks and commit them as genesis insert ops into the
; content:<id> stream, idempotently, with post metadata recorded as an event in a
; sibling stream. The same code runs on mem and durable persist backends (every fn
; takes the backend `b`, the acl.sx design principle).
;
; A `post` is a dict mirroring the blog Post row:
; {:id "uuid" :slug "hello" :title "Hello" :status "published"
; :visibility "public" :tags (list "a") :authors (list "u1")
; :lexical <lexical-doc-as-sx-dict>}
; Reading real rows (internal-data query vs direct Postgres, Q-M4) is the live-source
; edge, out of scope here; this drives content/commit! given a `post` dict.
; --- genesis ops: insert each block in document order (deterministic) -----------
; first block after nil (prepend), each subsequent after the previous block's id,
; reproducing source order so re-import yields the same sequence (data-migration §5).
(define
blogimport/genesis-ops
(fn (blocks)
(let ((ids (map blk-id blocks)))
(map-indexed
(fn (i blk) (op-insert blk (if (= i 0) nil (nth ids (- i 1)))))
blocks))))
; --- post metadata (title/slug/status/visibility/tags/authors) ------------------
(define
blogimport/post-meta
(fn (post)
{:title (or (get post :title) "")
:slug (or (get post :slug) "")
:status (or (get post :status) "")
:visibility (or (get post :visibility) "")
:tags (or (get post :tags) (list))
:authors (or (get post :authors) (list))}))
; metadata is not a content op, so it rides a sibling event stream postmeta:<id>;
; latest event wins (LWW). Replayable + durable like the block op-log.
(define blogimport/meta-stream (fn (id) (str "postmeta:" id)))
(define
blogimport/commit-meta!
(fn (b id meta at)
(persist/append b (blogimport/meta-stream id) "post-meta" at meta)))
(define
blogimport/load-meta
(fn (b id)
(let ((evs (persist/read b (blogimport/meta-stream id))))
(if (= (len evs) 0) nil (persist/event-data (nth evs (- (len evs) 1)))))))
; --- idempotency: a stream already holding events is already imported -----------
; (host-persist guarantees monotonic seq but NOT dedupe — skip-if-exists is the
; importer's dedupe, so re-running the backfill never double-imports. data-migration
; §5.) Re-import with an improved converter (Q-M5) is future work — superseding,
; not duplicating; this build is import-once.
(define
blogimport/imported?
(fn (b id) (> (content/version-count b id) 0)))
; --- import one post ------------------------------------------------------------
(define
blogimport/import-post!
(fn (b post at)
(let ((id (get post :id)))
(if
(blogimport/imported? b id)
{:id id :imported false :reason "exists"}
(let ((blocks (blogimport/lex-blocks (get post :lexical))))
(begin
(content/commit-all! b id (blogimport/genesis-ops blocks) at)
(blogimport/commit-meta! b id (blogimport/post-meta post) at)
{:id id :imported true :blocks (len blocks)}))))))
; --- import many: coverage scoreboard -------------------------------------------
(define
blogimport/import-all!
(fn (b posts at)
(let ((results (map (fn (p) (blogimport/import-post! b p at)) posts)))
{:total (len results)
:imported (len (filter (fn (r) (get r :imported)) results))
:skipped (len (filter (fn (r) (not (get r :imported))) results))})))

129
lib/blogimport/lexical.sx Normal file
View File

@@ -0,0 +1,129 @@
; lib/blogimport/lexical.sx
; Lexical (Ghost editor JSON, as SX dicts) -> content-on-sx block list.
;
; The blog migration's lexical->blocks converter. Lives on the blog/migration
; side (NOT lib/content, NOT lib/host) per plans/migration/data-migration.md §7.
;
; Input shape: a lexical document is an SX dict mirroring the JSON 1:1, e.g.
; {:root {:children (list
; {:type "heading" :tag "h2" :children (list {:type "text" :text "Hi"})}
; {:type "paragraph" :children (list
; {:type "text" :text "plain "}
; {:type "text" :text "bold" :format 1}
; {:type "link" :url "/x" :children (list {:type "text" :text "here"})})})}}
;
; Block ids are assigned deterministically by top-level position ("b0","b1",...)
; so a re-import yields the SAME block sequence (data-migration.md §5 ordering rule).
;
; INLINE FORMATTING: architecture's content model holds PLAIN-STRING text
; (mk-text id text). Phase-5 rich inline runs are not merged here yet, so inline
; nodes are flattened to their plain concatenation (== asText, drift-proof). The
; single swap-point for the runs upgrade is `lex-inline-text` below — when
; content-on-sx Phase 5 lands on architecture, return runs there instead of a
; string. (slice-01-blog.md Q-B1; "prove the machinery first, then swap".)
; Inline format bitmask (lexical): bold=1 italic=2 strikethrough=4 underline=8
; code=16 subscript=32 superscript=64. Decoding the bitmask into mark keywords is
; deferred to the Phase-5 runs upgrade (no bitwise prim on architecture, and the
; active path flattens to plain text anyway). The :format field is read at the
; swap-point `lex-inline-text` when runs land.
; --- inline node -> plain text --------------------------------------------------
(define
lex-inline-node-text
(fn (node)
(let ((t (get node :type)))
(cond
((equal? t "text") (or (get node :text) ""))
((equal? t "linebreak") "\n")
((equal? t "tab") "\t")
((equal? t "link") (lex-inline-text (or (get node :children) (list))))
((equal? t "autolink") (lex-inline-text (or (get node :children) (list))))
((equal? t "at-link") (lex-inline-text (or (get node :children) (list))))
((equal? t "code-highlight") (or (get node :text) ""))
(else "")))))
; flatten a list of inline nodes to one plain string.
; *** Phase-5 swap-point: return a runs list here once mk-text accepts runs. ***
(define
lex-inline-text
(fn (children)
(reduce
(fn (acc n) (str acc (lex-inline-node-text n)))
""
children)))
; --- helpers --------------------------------------------------------------------
(define
lex-heading-level
(fn (tag)
(cond
((equal? tag "h1") 1)
((equal? tag "h2") 2)
((equal? tag "h3") 3)
((equal? tag "h4") 4)
((equal? tag "h5") 5)
((equal? tag "h6") 6)
(else 2))))
(define
lex-listitem-text
(fn (item)
(lex-inline-text (or (get item :children) (list)))))
; --- one lexical block node -> a content block (id assigned by caller) ----------
(define
lex-block
(fn (node id)
(let ((t (get node :type)))
(cond
((equal? t "paragraph")
(mk-text id (lex-inline-text (or (get node :children) (list)))))
((equal? t "extended-text")
(mk-text id (lex-inline-text (or (get node :children) (list)))))
((equal? t "heading")
(mk-heading id (lex-heading-level (get node :tag))
(lex-inline-text (or (get node :children) (list)))))
((equal? t "extended-heading")
(mk-heading id (lex-heading-level (get node :tag))
(lex-inline-text (or (get node :children) (list)))))
((equal? t "quote")
(mk-quote id "" (lex-inline-text (or (get node :children) (list)))))
((equal? t "extended-quote")
(mk-quote id "" (lex-inline-text (or (get node :children) (list)))))
((equal? t "codeblock")
(mk-code id (or (get node :language) "") (or (get node :code) "")))
((equal? t "list")
(mk-list id
(equal? (get node :listType) "number")
(map lex-listitem-text (or (get node :children) (list)))))
((equal? t "horizontalrule") (mk-divider id))
((equal? t "image")
(mk-image id (or (get node :src) "") (or (get node :alt) "")))
((equal? t "callout")
(mk-callout id (or (get node :backgroundColor) "grey")
(lex-inline-text (or (get node :children) (list)))))
((equal? t "video") (mk-media id "video" (or (get node :src) "")))
((equal? t "audio") (mk-media id "audio" (or (get node :src) "")))
((equal? t "embed") (mk-embed id (or (get node :url) "") "embed"))
((equal? t "bookmark") (mk-embed id (or (get node :url) "") "bookmark"))
; unknown/unsupported card: route to a generic embed tagged by type so
; nothing is silently dropped (provider records the original node type).
(else (mk-embed id "" (or t "unknown")))))))
; --- doc -> top-level children list ---------------------------------------------
(define
lex-doc-children
(fn (doc)
(cond
((not (equal? (get doc :root) nil)) (or (get (get doc :root) :children) (list)))
((not (equal? (get doc :children) nil)) (get doc :children))
(else (list)))))
; --- doc -> content block list (deterministic ids by position) ------------------
(define
blogimport/lex-blocks
(fn (doc)
(map-indexed
(fn (i node) (lex-block node (str "b" i)))
(lex-doc-children doc))))

View File

@@ -0,0 +1,10 @@
{
"suites": {
"lexical": {"pass": 23, "fail": 0},
"import": {"pass": 21, "fail": 0},
"verify": {"pass": 11, "fail": 0}
},
"total_pass": 55,
"total_fail": 0,
"total": 55
}

View File

@@ -0,0 +1,10 @@
# blogimport Conformance Scoreboard
_Generated by `lib/blogimport/conformance.sh`_
| Suite | Pass | Fail | Total |
|-------|-----:|-----:|------:|
| lexical | 23 | 0 | 23 |
| import | 21 | 0 | 21 |
| verify | 11 | 0 | 11 |
| **Total** | **55** | **0** | **55** |

View File

@@ -0,0 +1,62 @@
; lib/blogimport/tests/import.sx — genesis import + idempotency
(st-bootstrap-classes!)
(content-bootstrap-blocks!)
(content-bootstrap-doc!)
(content-bootstrap-callout!)
(content-bootstrap-media!)
(define
p1
{:id "post-1" :slug "hello" :title "Hello" :status "published"
:visibility "public" :tags (list "news") :authors (list "u1")
:lexical {:root {:children (list
{:type "heading" :tag "h1" :children (list {:type "text" :text "Hello"})}
{:type "paragraph" :children (list {:type "text" :text "world"})})}}})
(define
p2
{:id "post-2" :slug "two" :title "Two" :status "published"
:lexical {:children (list
{:type "paragraph" :children (list {:type "text" :text "second"})})}})
; ---- genesis-ops ordering ----
(define ops1 (blogimport/genesis-ops (blogimport/lex-blocks (get p1 :lexical))))
(bi-test "genesis op kinds" (map (fn (o) (get o :op)) ops1) (list "insert" "insert"))
(bi-test "genesis first after nil" (get (nth ops1 0) :after) nil)
(bi-test "genesis second after first id" (get (nth ops1 1) :after) "b0")
; ---- import one ----
(define B (persist/open))
(define r1 (blogimport/import-post! B p1 10))
(bi-test "import imported flag" (get r1 :imported) true)
(bi-test "import block count" (get r1 :blocks) 2)
(bi-test "stream version-count" (content/version-count B "post-1") 2)
(bi-test "head ids" (doc-ids (content/head B "post-1")) (list "b0" "b1"))
(bi-test "head body text"
(str (blk-send (doc-find (content/head B "post-1") "b1") "text")) "world")
(bi-test "head heading level"
(blk-send (doc-find (content/head B "post-1") "b0") "level") 1)
; ---- metadata round-trip ----
(bi-test "meta round-trip" (blogimport/load-meta B "post-1") (blogimport/post-meta p1))
(bi-test "meta title" (get (blogimport/load-meta B "post-1") :title) "Hello")
(bi-test "meta tags" (get (blogimport/load-meta B "post-1") :tags) (list "news"))
; ---- idempotent re-import (skip-if-exists, no duplication) ----
(define r1b (blogimport/import-post! B p1 99))
(bi-test "reimport skipped" (get r1b :imported) false)
(bi-test "reimport reason" (get r1b :reason) "exists")
(bi-test "version-count unchanged after reimport" (content/version-count B "post-1") 2)
(bi-test "head ids unchanged after reimport"
(doc-ids (content/head B "post-1")) (list "b0" "b1"))
; ---- import-all! coverage scoreboard ----
(define B2 (persist/open))
(define cov1 (blogimport/import-all! B2 (list p1 p2) 5))
(bi-test "import-all total" (get cov1 :total) 2)
(bi-test "import-all imported" (get cov1 :imported) 2)
(bi-test "import-all skipped" (get cov1 :skipped) 0)
; re-run is fully idempotent
(define cov2 (blogimport/import-all! B2 (list p1 p2) 6))
(bi-test "import-all rerun imported" (get cov2 :imported) 0)
(bi-test "import-all rerun skipped" (get cov2 :skipped) 2)

View File

@@ -0,0 +1,92 @@
; lib/blogimport/tests/lexical.sx — lexical -> content block converter
(st-bootstrap-classes!)
(content-bootstrap-blocks!)
(content-bootstrap-doc!)
(content-bootstrap-callout!)
(content-bootstrap-media!)
; ---- a representative lexical document (Ghost editor JSON, as SX dicts) ----
(define
doc
{:root {:children (list
{:type "heading" :tag "h2" :children (list {:type "text" :text "Title"})}
{:type "paragraph" :children (list
{:type "text" :text "plain "}
{:type "text" :text "bold" :format 1}
{:type "text" :text " then "}
{:type "link" :url "/x" :children (list {:type "text" :text "a link"})})}
{:type "quote" :children (list {:type "text" :text "wise words"})}
{:type "list" :listType "number" :children (list
{:type "listitem" :children (list {:type "text" :text "one"})}
{:type "listitem" :children (list {:type "text" :text "two"})})}
{:type "codeblock" :language "python" :code "print(1)"}
{:type "horizontalrule"}
{:type "image" :src "/c.png" :alt "a cat"}
{:type "callout" :backgroundColor "blue" :children (list {:type "text" :text "note!"})}
{:type "twitter" :url "https://t/x"})}})
(define blocks (blogimport/lex-blocks doc))
; ---- structure ----
(bi-test "block count" (len blocks) 9)
(bi-test "ids by position" (map blk-id blocks)
(list "b0" "b1" "b2" "b3" "b4" "b5" "b6" "b7" "b8"))
(bi-test "types in order" (map blk-type blocks)
(list "heading" "text" "quote" "list" "code" "divider" "image" "callout" "embed"))
; ---- heading ----
(bi-test "heading level" (blk-send (nth blocks 0) "level") 2)
(bi-test "heading text" (str (blk-send (nth blocks 0) "text")) "Title")
; ---- paragraph with inline bold + link, flattened to plain concatenation ----
(bi-test "paragraph flattened text"
(str (blk-send (nth blocks 1) "text")) "plain bold then a link")
; ---- quote ----
(bi-test "quote text" (str (blk-send (nth blocks 2) "text")) "wise words")
; ---- ordered list with items ----
(bi-test "list ordered" (blk-send (nth blocks 3) "ordered") true)
(bi-test "list items" (blk-send (nth blocks 3) "items") (list "one" "two"))
; ---- code block ----
(bi-test "code language" (str (blk-send (nth blocks 4) "language")) "python")
(bi-test "code text" (str (blk-send (nth blocks 4) "text")) "print(1)")
; ---- image ----
(bi-test "image src" (str (blk-send (nth blocks 6) "src")) "/c.png")
(bi-test "image alt" (str (blk-send (nth blocks 6) "alt")) "a cat")
; ---- callout ----
(bi-test "callout kind" (str (blk-send (nth blocks 7) "kind")) "blue")
(bi-test "callout text" (str (blk-send (nth blocks 7) "text")) "note!")
; ---- unknown card routed to embed, provider records original type ----
(bi-test "unknown -> embed provider" (str (blk-send (nth blocks 8) "provider")) "twitter")
; ---- heading level mapping ----
(bi-test "h1 level" (lex-heading-level "h1") 1)
(bi-test "h4 level" (lex-heading-level "h4") 4)
(bi-test "unknown tag default" (lex-heading-level "hx") 2)
; ---- bullet list ----
(define
bdoc
{:children (list {:type "list" :listType "bullet" :children (list
{:type "listitem" :children (list {:type "text" :text "x"})})})})
(bi-test "bullet not ordered" (blk-send (nth (blogimport/lex-blocks bdoc) 0) "ordered") false)
; ---- empty doc ----
(bi-test "empty doc -> no blocks" (len (blogimport/lex-blocks {:root {:children (list)}})) 0)
; ---- bare-children doc (no :root wrapper) ----
(bi-test "bare children doc"
(map blk-type (blogimport/lex-blocks {:children (list {:type "paragraph" :children (list {:type "text" :text "hi"})})}))
(list "text"))
; ---- linebreak/tab in inline flattening ----
(bi-test "linebreak flatten"
(str (blk-send (nth (blogimport/lex-blocks
{:children (list {:type "paragraph" :children (list
{:type "text" :text "a"} {:type "linebreak"} {:type "text" :text "b"})})}) 0) "text"))
"a\nb")

View File

@@ -0,0 +1,57 @@
; lib/blogimport/tests/verify.sx — shadow-diff at rest (round-trip parity)
(st-bootstrap-classes!)
(content-bootstrap-blocks!)
(content-bootstrap-doc!)
(content-bootstrap-callout!)
(content-bootstrap-media!)
(define
p1
{:id "post-1" :slug "hello" :title "Hello" :status "published"
:visibility "public" :tags (list "news") :authors (list "u1")
:lexical {:root {:children (list
{:type "heading" :tag "h2" :children (list {:type "text" :text "Title"})}
{:type "paragraph" :children (list
{:type "text" :text "plain "}
{:type "text" :text "bold" :format 1})}
{:type "list" :listType "number" :children (list
{:type "listitem" :children (list {:type "text" :text "one"})}
{:type "listitem" :children (list {:type "text" :text "two"})})}
{:type "image" :src "/c.png" :alt "cat"})}}})
(define
px
{:id "post-x" :slug "ghost" :title "Ghost" :status "published"
:lexical {:children (list {:type "paragraph" :children (list {:type "text" :text "never imported"})})}})
; ---- happy path: replayed == oracle ----
(define B (persist/open))
(blogimport/import-post! B p1 10)
(define v1 (blogimport/verify-post B p1))
(bi-test "verify ok" (get v1 :ok) true)
(bi-test "verify block-ok" (get v1 :block-ok) true)
(bi-test "verify meta-ok" (get v1 :meta-ok) true)
; ---- oracle block model is what we expect (inline bold flattened) ----
(define orc (blogimport/oracle p1))
(bi-test "oracle types"
(get (get orc :blocks) :types) (list "heading" "text" "list" "image"))
(bi-test "oracle contents"
(get (get orc :blocks) :contents) (list "Title" "plain bold" (list "one" "two") "/c.png"))
; ---- corruption is DETECTED (op-log diverges from oracle) ----
(content/commit! B "post-1" (op-update "b1" "text" "CORRUPTED") 100)
(define v2 (blogimport/verify-post B p1))
(bi-test "verify detects corruption" (get v2 :ok) false)
(bi-test "verify corruption is block-level" (get v2 :block-ok) false)
; ---- an un-imported post fails verification (empty replay vs non-empty oracle) ----
(bi-test "unimported not ok" (get (blogimport/verify-post B px) :ok) false)
; ---- verify-all coverage scoreboard ----
(define B3 (persist/open))
(blogimport/import-post! B3 p1 10)
(define cov (blogimport/verify-all B3 (list p1 px)))
(bi-test "verify-all total" (get cov :total) 2)
(bi-test "verify-all ok count" (get cov :ok) 1)
(bi-test "verify-all mismatched" (get cov :mismatched) (list "post-x"))

73
lib/blogimport/verify.sx Normal file
View File

@@ -0,0 +1,73 @@
; lib/blogimport/verify.sx
; Shadow-diff at rest (plans/migration/data-migration.md §6, slice-01-blog.md §4).
;
; After backfill, replay each content:<id> stream -> materialized doc -> block
; model, and diff against the row-derived oracle (lexical->blocks computed directly).
; Structural compare with `=` (not equal?). This proves the genesis import + op-log
; replay is LOSSLESS — "did the backfill corrupt anything" at rest.
;
; The oracle here is the in-memory lexical->blocks of the SAME post, so the property
; verified is round-trip fidelity through persist. Cross-checking against the LIVE
; Python block model (the "does SX match Python" half of Q-D2) is a later wiring
; step that needs the Python oracle via the internal-data query (Q-M4) — flagged,
; not built. The diff plumbing here is the twin that step reuses.
; --- salient content per block (normalized; same on both sides) -----------------
; ids are deterministic + identical on both sides, so they are kept (not stripped).
(define
blogimport/blk-content
(fn (b)
(let ((t (blk-type b)))
(cond
((equal? t "image") (str (blk-send b "src")))
((equal? t "media") (str (blk-send b "src")))
((equal? t "embed") (str (blk-send b "url")))
((equal? t "list") (blk-send b "items"))
((equal? t "divider") "")
(else (str (blk-send b "text")))))))
; --- block model of a block list ------------------------------------------------
(define
blogimport/blocks-model
(fn (blocks)
{:ids (map blk-id blocks)
:types (map blk-type blocks)
:contents (map blogimport/blk-content blocks)}))
; --- oracle: lexical->blocks computed directly from the post (no persist) --------
(define
blogimport/oracle
(fn (post)
{:blocks (blogimport/blocks-model (blogimport/lex-blocks (get post :lexical)))
:meta (blogimport/post-meta post)}))
; --- replayed: from the persisted stream ----------------------------------------
(define
blogimport/replayed
(fn (b id)
{:blocks (blogimport/blocks-model (content/blocks (content/head b id)))
:meta (blogimport/load-meta b id)}))
; --- verify one post: replayed must equal oracle --------------------------------
(define
blogimport/verify-post
(fn (b post)
(let ((id (get post :id)))
(let ((orc (blogimport/oracle post))
(rep (blogimport/replayed b id)))
(let ((block-ok (= (get orc :blocks) (get rep :blocks)))
(meta-ok (= (get orc :meta) (get rep :meta))))
{:id id
:ok (and block-ok meta-ok)
:block-ok block-ok
:meta-ok meta-ok})))))
; --- verify many: coverage scoreboard -------------------------------------------
(define
blogimport/verify-all
(fn (b posts)
(let ((results (map (fn (p) (blogimport/verify-post b p)) posts)))
{:total (len results)
:ok (len (filter (fn (r) (get r :ok)) results))
:mismatched (map (fn (r) (get r :id))
(filter (fn (r) (not (get r :ok))) results))})))