Comprehensive proof plan for the remaining Phase 3 lemmas (the
String-level universal round-trip). Covers:
· Current state — Phase 1, 2, length bounds, token-level
universal, atomic decide witnesses, foundation tokenize lemmas.
· The obstacle — Lean 4.30 String is UTF-8 ByteArray-backed,
so `s.push c ++ xs.asString = s ++ (c :: xs).asString` is
not definitionally true.
· Two solution paths:
A. Refactor `readIdent`/`readStrLit` to `List Char`
accumulators (~20 line refactor + ~470 lines proof).
B. Import Mathlib's structural String lemmas (~500 lines
proof, no refactor).
· Lemma-by-lemma proof plan with statements, sketches, and
line-count estimates for each of the 7 lemmas.
· Recommendation: Path A — stays in Lean core, decouples all
future Phase 3 proofs from String internals.
Estimated effort: 2–3 focused days for either path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Phase 3 — String-Level Universal Round-Trip Proof Plan
Goal
Prove the universal theorem
∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v
(and parallel theorems for Lean.Name, MetaClassifier, MetaArtifact)
without using decide, native_decide, or closed-instance hacks. Full
structural induction over the meta-mirror types.
Current state
Proven (Foundation/MetaParse.lean)
Phase 1 — Canonical token forms (definitions):
nameToTokens : Lean.Name → List TokenMetaClassifier.toTokens : MetaClassifier → List TokenMetaCTerm.toTokens : MetaCTerm → List TokenMetaArtifact.toTokens : MetaArtifact → List Token
Phase 2 — Parser correctness on toTokens (universal, structural induction):
parseName?Aux_correctparseClassifier?Aux_correctparseMetaCTerm?Aux_correctparseArtifact?Aux_correct(modulo.declAt)
Length-vs-depth lemmas:
nameToTokens_length_bound,classifierToTokens_length_bound,cTermToTokens_length_bound
Token-level universal round-trip (the headline results):
nameFromTokens?_round_trip : ∀ n, fromTokens? (toTokens n) = some n
classifierFromTokens?_round_trip : ∀ φ, fromTokens? φ.toTokens = some φ
cTermFromTokens?_round_trip : ∀ t, fromTokens? t.toTokens = some t
artifactFromTokens?_round_trip : ∀ a, a.supported → fromTokens? a.toTokens = some a
Atomic Phase 3 witnesses via decide (kernel-rooted, closed inputs):
tokenize_render_name_anonymoustokenize_render_classifier_always/_nevertokenize_render_cterm_emptytokenize_render_artifact_empty
Foundation tokenize lemmas:
tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize resttokenize_rparen_cons(analogous for))tokenize_space_cons(whitespace skip)
Open (the four distribution lemmas)
readIdent_app : -- ident chars + clean boundary → exact accumulation
readStrLit_app : -- escapeStrLit body + close → recovers original String
tokenize_app_clean: -- tokenize distributes over clean concatenation
tokenize_render_X: -- by induction over each meta-mirror type
These compose to ∀ v, tokenize (toLeanSource v).toList = toTokens v,
which combined with Phase 2 yields the full String-level universal.
The obstacle
Lean 4.30's String is UTF-8 ByteArray-backed, not a List Char
structure. Consequently:
example (s : String) (c : Char) (xs : List Char) :
s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl -- FAILS
This identity is not definitionally true. The kernel sees String.push,
String.append, and List.asString as primitive (or natively-implemented)
operations that don't unfold to a common normal form via rfl.
This blocks the inductive step of readIdent_app: the inductive
hypothesis returns (acc.push d ++ ds.asString, rest) but we want to
conclude (acc ++ (d :: ds).asString, rest). The two are
propositionally equal but require a String-internals lemma to bridge.
Two solution paths
Path A: Refactor to List Char accumulators
Replace the String accumulator in readIdent and readStrLit with a
List Char. Convert to String only at the emission point in
tokenizeAux.
Before:
def readIdent : Nat → List Char → String → String × List Char
| n+1, c :: rest, acc =>
if isIdentRestChar c then readIdent n rest (acc.push c)
else (acc, c :: rest)
After:
def readIdent : Nat → List Char → List Char → List Char × List Char
| n+1, c :: rest, acc =>
if isIdentRestChar c then readIdent n rest (acc ++ [c])
else (acc, c :: rest)
-- in tokenizeAux:
| n+1, c :: rest =>
if isIdentStartChar c then
let (consumed, rest') := readIdent (n+1) (c :: rest) []
Token.ident consumed.asString :: tokenizeAux n rest'
Now readIdent_app becomes a pure List Char lemma, structurally provable:
theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat)
(hf : fuel > chars.length)
(hcs : ∀ c ∈ chars, isIdentRestChar c = true)
(hr : rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) :
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by
induction chars generalizing acc fuel with
| nil => rcases fuel with _ | f
· simp at hf
· simp [readIdent]
cases hr with
| inl h => subst h; simp [readIdent]
| inr h => obtain ⟨c, r', hcr, hcb⟩ := h
subst hcr; simp [readIdent, hcb]
| cons d ds ih =>
rcases fuel with _ | f
· simp at hf
· have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _)
have hds : ∀ c ∈ ds, isIdentRestChar c = true :=
fun c hc => hcs c (List.mem_cons_of_mem _ hc)
have hf' : f > ds.length := by simp [List.length_cons] at hf; omega
have ih' := ih (acc ++ [d]) f hf' hds
simp [readIdent, hd]
rw [ih']
simp [List.append_assoc]
The String-internals lemma is replaced by List.append_assoc —
trivially provable.
Cost: ~50 lines refactor of readIdent/readStrLit/tokenizeAux,
~200 lines of proof.
Path B: Use Mathlib's String API
Mathlib provides the structural String lemmas needed:
String.toList_append : (s ++ t).toList = s.toList ++ t.toListString.push_eq : s.push c = ⟨s.toList ++ [c]⟩(or equivalent)String.eq_iff_toList_eq(extensionality)
With these, the original String-accumulator readIdent can be
proven correct without refactor. The proof structure is the same as
Path A but uses the Mathlib lemmas to bridge between String.push
and List.cons.
Cost: Add mathlib as a dependency to infoductor (already
present transitively via Mathlib import in some Foundation files,
but not directly used). Then ~200 lines of proof using Mathlib's
String API.
Lemma-by-lemma proof plan
Lemma 1: readIdent_app
Statement (Path A version):
theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat),
fuel > chars.length →
(∀ c ∈ chars, isIdentRestChar c = true) →
(rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) →
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest)
Proof: structural induction on chars. Base case ([]) splits on
the boundary form. Inductive case unfolds readIdent once, applies the
ident-rest hypothesis, and recurses via IH with acc ++ [d].
Difficulty: Low (Path A) / Medium (Path B).
Lemma 2: readStrLit_app
Statement:
theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) :
fuel > (escapeStrLitBody s).length + 1 →
readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc =
some (acc ++ s.toList, rest)
Where escapeStrLitBody s is the inner part of escapeStrLit (without
the surrounding ").
Proof: induction on s.toList. Each char either escapes (" → \",
\ → \\) or passes through. The readStrLit decoder reverses the
escape exactly.
Difficulty: Medium. The escape/unescape pairing is straightforward
but tedious: 3 sub-cases per char (escape ", escape \, normal).
Lemma 3: tokenize_app_clean
Statement:
theorem tokenize_app_clean (xs ys : List Char) :
isCleanlyTerminated xs →
tokenize (xs ++ ys) = tokenize xs ++ tokenize ys
Where isCleanlyTerminated xs says xs ends in a single-char-token
(paren), whitespace, or string-literal close — not in the middle of
an ident, num, or open string.
Proof: induction on xs. Use the foundation tokenize lemmas
(tokenize_lparen_cons etc.) for the constructor cases. The
"cleanly terminated" predicate ensures no token spans the boundary.
Difficulty: Medium. Defining isCleanlyTerminated precisely and
proving each clean-prefix case takes care.
Lemma 4: tokenize_render_name
Statement:
theorem tokenize_render_name : ∀ (n : Lean.Name),
tokenize (nameToLeanSource n).toList = nameToTokens n
Proof: induction on n.
-
.anonymous:nameToLeanSource .anonymous = "Lean.Name.anonymous". Tokenize this viareadIdent_app(Lemma 1) withrest = [], gives[.ident "Lean.Name.anonymous"]. EqualsnameToTokens .anonymous. -
.str p s:nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")".Use
String.toList_append(Mathlib) or the chars-version of the renderer to break(...).toList into:['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']Then apply
tokenize_app_clean(Lemma 3) repeatedly:'('→lparen :: tokenize ..."Lean.Name.str ".toList→ident "Lean.Name.str" :: tokenize ...(viareadIdent_appwith rest starting at' ')(nameToLeanSource p).toList→nameToTokens p ++ tokenize ...(via IH)[' ']→tokenize ...(whitespace skip)(escapeStrLit s).toList→[strLit s] ++ tokenize ...(viareadStrLit_app)[')']→[rparen]
Combine:
[lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen]=nameToTokens (.str p s). ✓ -
.num p k: analogous to.str, usingnumLitinstead ofstrLit.
Difficulty: Medium-High. Each step requires careful invocation of Lemmas 1–3.
Lemma 5: tokenize_render_classifier
Statement:
theorem tokenize_render_classifier : ∀ (φ : MetaClassifier),
tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ
Proof: 9 cases mirroring the 9 classifier arms. The atomic arms
(.always, .never) are direct. The .atDecl/.inFile/.underAttribute/.dependencyOf/.inNamespace cases use Lemmas 1+3+4. The .meet/.join
cases use Lemmas 1+3+IH.
Difficulty: Medium. Mostly mechanical after Lemmas 1–4.
Lemma 6: tokenize_render_cterm
Statement:
theorem tokenize_render_cterm : ∀ (t : MetaCTerm),
tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t
Proof: 8 cases. Atomic: .empty. Compositional: .ident (uses
Lemma 4), .sym (uses Lemma 2), .app (uses Lemma 6 IH), .lam/.plam
(uses Lemma 2 + IH), .comp/.transp (uses Lemmas 2 + 5 + IH).
Difficulty: Medium-High. Most complex case is .comp with 5
sub-fields chained.
Lemma 7: tokenize_render_artifact
Statement:
theorem tokenize_render_artifact : ∀ (a : MetaArtifact),
a.supported = true →
tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a
Proof: 5 cases. .empty/.source: direct. .cterm: uses Lemma 6.
.refTo: uses Lemma 4. .declAt: excluded by hypothesis.
Difficulty: Low after Lemmas 4–6.
Composing: the universal String-level round-trip
theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm),
MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by
intro t
unfold MetaCTerm.fromLeanSource? tokenizeStr
rw [tokenize_render_cterm t] -- Phase 3: Lemma 6
exact cTermFromTokens?_round_trip t -- Phase 2 (already proven)
Two-line proof composing Phase 2 with Lemma 6. Same shape for the other three meta-mirror types.
Estimated effort
| Lemma | Path A (refactor) | Path B (Mathlib) |
|---|---|---|
1. readIdent_app |
30 lines proof + 20 lines refactor | 50 lines proof |
2. readStrLit_app |
50 lines | 70 lines |
3. tokenize_app_clean |
80 lines (incl. predicate def) | 80 lines |
4. tokenize_render_name |
40 lines | 40 lines |
5. tokenize_render_classifier |
90 lines | 90 lines |
6. tokenize_render_cterm |
120 lines | 120 lines |
7. tokenize_render_artifact |
30 lines | 30 lines |
| Composition | 16 lines | 16 lines |
| Total | ~470 lines | ~500 lines |
Approximately 2–3 focused days of Lean proof work for either path.
Recommendation
Path A (refactor) is preferred:
- Stays within Lean core (no new dependencies).
- The refactor is small (~20 lines) and independently sound.
- Decouples all Phase 3 proofs from String internals — no future
proof has to wrestle with
String.push/String.appendagain. - The
List Characcumulator is arguably more Eulerian: tokens are conceptually sequences of characters, and operating on lists makes the equational reasoning transparent.
The refactor is itself the small Eulerian move; the rest of the proofs follow naturally.
What this unlocks
With Phase 3 complete, the bridge has:
∀ t : MetaCTerm, fromLeanSource? (toLeanSource t) = some t
∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ
∀ n : Lean.Name, nameFromLeanSource? (nameToLeanSource n) = some n
∀ a : MetaArtifact, a.supported → fromLeanSource? (toLeanSource a) = some a
The full String-level universal round-trip — provably, by structural
induction, with no closed-instance restrictions. The Edit.apply
output (Lean source files written to disk) is then provably
round-trippable: any consumer reading the file back gets the original
MetaCTerm.
This is the operational soundness of the entire bridge at the source level — what topolei or any external consumer needs to trust the edit-render-parse loop.