Phase 3 Lean 4.30 String-internals analysis

Attempted to prove `readIdent_app` (the key distribution lemma)
but hit a wall on `String.push c ++ xs.asString = s ++ (c :: xs).asString`
which is not definitionally true in Lean 4.30 (String is
UTF-8 ByteArray-backed, not a List Char structure).

Documents two cleaner paths forward:
  (a) Refactor `readIdent`/`readStrLit` to accumulate into
      `List Char` instead of `String`, decoupling proofs from
      String internals.
  (b) Import Mathlib's richer String API which provides the
      needed structural lemmas.

The committed state:
  · Atomic Phase 3 witnesses via decide (5 theorems, kernel-rooted).
  · 3 foundation tokenize lemmas (lparen/rparen/space).
  · The full Phase 3 universal lemmas documented inline as
    open work, with a proof sketch for both paths.

The token-level universal (already proven) plus closed-instance
decide tests cover the round-trip operationally.  Adding Phase 3
proper is a focused multi-day refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Maximus Gorog 2026-05-01 13:12:38 -06:00
parent 62b73bb176
commit 665046a353

View file

@ -821,21 +821,28 @@ theorem tokenize_render_artifact_empty :
MetaArtifact.toTokens MetaArtifact.empty := by
decide
-- The recursive arms (`.str`, `.app`, `.lam`, etc.) require the
-- four substantial lemmas sketched below — multi-page Lean
-- reasoning about `readIdent` / `readStrLit` distribution.
-- Documented as future work; the token-level universal above
-- plus `decide` on closed instances cover the round-trip
-- operationally in the meantime.
-- ── Phase 3 distribution lemmas — open question ────────────────────────────
-- Proving `tokenize ∘ render = toTokens` requires four substantial
-- distribution lemmas:
--
-- readIdent_app : readIdent on (ident-chars ++ rest) where
-- rest starts cleanly returns (acc ++ ident, rest).
-- readStrLit_app : readStrLit on (escapeStrLit body ++ '\"' ++ rest)
-- returns (body, rest).
-- readIdent_app : reading an ident sequence followed by a clean
-- boundary returns the accumulated ident.
-- readStrLit_app: reading an escapeStrLit-encoded body returns
-- the original String.
-- tokenize_app_clean: tokenize distributes over a concatenation
-- where the prefix ends "cleanly".
-- tokenize_render_X: induction over each meta-mirror type using
-- the above plus IH on sub-values.
-- where the prefix ends "cleanly".
-- tokenize_render_X: induction over each meta-mirror type.
--
-- Each requires reasoning about Lean 4.30's `String` API (UTF-8
-- ByteArray-backed) — specifically the equation
-- `s.push c ++ xs.asString = s ++ (c :: xs).asString`, which is
-- *not* definitionally true and needs a String-internals lemma.
--
-- Cleaner paths forward: (a) refactor `readIdent`/`readStrLit` to
-- accumulate into a `List Char` rather than a `String`, then build
-- the String at the end via `List.asString` — decouples the proof
-- from String internals; or (b) import Mathlib's richer String
-- API which provides the needed lemmas.
-- ── Round-trip — atomic kernel-reducible witnesses ─────────────────────────
-- For non-recursive shapes the round-trip is closed by `rfl` (or