# Phase 3 — String-Level Universal Round-Trip Proof Plan ## Goal Prove the universal theorem ```lean ∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v ``` (and parallel theorems for `Lean.Name`, `MetaClassifier`, `MetaArtifact`) without using `decide`, `native_decide`, or closed-instance hacks. Full structural induction over the meta-mirror types. ## Current state ### Proven (`Foundation/MetaParse.lean`) **Phase 1 — Canonical token forms** (definitions): - `nameToTokens : Lean.Name → List Token` - `MetaClassifier.toTokens : MetaClassifier → List Token` - `MetaCTerm.toTokens : MetaCTerm → List Token` - `MetaArtifact.toTokens : MetaArtifact → List Token` **Phase 2 — Parser correctness on `toTokens`** (universal, structural induction): - `parseName?Aux_correct` - `parseClassifier?Aux_correct` - `parseMetaCTerm?Aux_correct` - `parseArtifact?Aux_correct` (modulo `.declAt`) **Length-vs-depth lemmas**: - `nameToTokens_length_bound`, `classifierToTokens_length_bound`, `cTermToTokens_length_bound` **Token-level universal round-trip** (the headline results): ```lean nameFromTokens?_round_trip : ∀ n, fromTokens? (toTokens n) = some n classifierFromTokens?_round_trip : ∀ φ, fromTokens? φ.toTokens = some φ cTermFromTokens?_round_trip : ∀ t, fromTokens? t.toTokens = some t artifactFromTokens?_round_trip : ∀ a, a.supported → fromTokens? a.toTokens = some a ``` **Atomic Phase 3 witnesses** via `decide` (kernel-rooted, closed inputs): - `tokenize_render_name_anonymous` - `tokenize_render_classifier_always` / `_never` - `tokenize_render_cterm_empty` - `tokenize_render_artifact_empty` **Foundation tokenize lemmas**: - `tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize rest` - `tokenize_rparen_cons` (analogous for `)`) - `tokenize_space_cons` (whitespace skip) ### Open (the four distribution lemmas) ```lean readIdent_app : -- ident chars + clean boundary → exact accumulation readStrLit_app : -- escapeStrLit body + close → recovers original String tokenize_app_clean: -- tokenize distributes over clean concatenation tokenize_render_X: -- by induction over each meta-mirror type ``` These compose to `∀ v, tokenize (toLeanSource v).toList = toTokens v`, which combined with Phase 2 yields the full String-level universal. ## The obstacle Lean 4.30's `String` is **UTF-8 ByteArray-backed**, not a `List Char` structure. Consequently: ```lean example (s : String) (c : Char) (xs : List Char) : s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl -- FAILS ``` This identity is **not definitionally true**. The kernel sees `String.push`, `String.append`, and `List.asString` as primitive (or natively-implemented) operations that don't unfold to a common normal form via `rfl`. This blocks the inductive step of `readIdent_app`: the inductive hypothesis returns `(acc.push d ++ ds.asString, rest)` but we want to conclude `(acc ++ (d :: ds).asString, rest)`. The two are **propositionally equal** but require a String-internals lemma to bridge. ## Two solution paths ### Path A: Refactor to `List Char` accumulators Replace the `String` accumulator in `readIdent` and `readStrLit` with a `List Char`. Convert to `String` only at the emission point in `tokenizeAux`. **Before**: ```lean def readIdent : Nat → List Char → String → String × List Char | n+1, c :: rest, acc => if isIdentRestChar c then readIdent n rest (acc.push c) else (acc, c :: rest) ``` **After**: ```lean def readIdent : Nat → List Char → List Char → List Char × List Char | n+1, c :: rest, acc => if isIdentRestChar c then readIdent n rest (acc ++ [c]) else (acc, c :: rest) -- in tokenizeAux: | n+1, c :: rest => if isIdentStartChar c then let (consumed, rest') := readIdent (n+1) (c :: rest) [] Token.ident consumed.asString :: tokenizeAux n rest' ``` Now `readIdent_app` becomes a pure `List Char` lemma, structurally provable: ```lean theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat) (hf : fuel > chars.length) (hcs : ∀ c ∈ chars, isIdentRestChar c = true) (hr : rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) : readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by induction chars generalizing acc fuel with | nil => rcases fuel with _ | f · simp at hf · simp [readIdent] cases hr with | inl h => subst h; simp [readIdent] | inr h => obtain ⟨c, r', hcr, hcb⟩ := h subst hcr; simp [readIdent, hcb] | cons d ds ih => rcases fuel with _ | f · simp at hf · have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _) have hds : ∀ c ∈ ds, isIdentRestChar c = true := fun c hc => hcs c (List.mem_cons_of_mem _ hc) have hf' : f > ds.length := by simp [List.length_cons] at hf; omega have ih' := ih (acc ++ [d]) f hf' hds simp [readIdent, hd] rw [ih'] simp [List.append_assoc] ``` The String-internals lemma is replaced by `List.append_assoc` — trivially provable. **Cost**: ~50 lines refactor of `readIdent`/`readStrLit`/`tokenizeAux`, ~200 lines of proof. ### Path B: Use Mathlib's String API Mathlib provides the structural String lemmas needed: - `String.toList_append : (s ++ t).toList = s.toList ++ t.toList` - `String.push_eq : s.push c = ⟨s.toList ++ [c]⟩` (or equivalent) - `String.eq_iff_toList_eq` (extensionality) With these, the original `String`-accumulator `readIdent` can be proven correct without refactor. The proof structure is the same as Path A but uses the Mathlib lemmas to bridge between `String.push` and `List.cons`. **Cost**: Add `mathlib` as a dependency to `infoductor` (already present transitively via `Mathlib` import in some Foundation files, but not directly used). Then ~200 lines of proof using Mathlib's String API. ## Lemma-by-lemma proof plan ### Lemma 1: `readIdent_app` **Statement** (Path A version): ```lean theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat), fuel > chars.length → (∀ c ∈ chars, isIdentRestChar c = true) → (rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) → readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) ``` **Proof**: structural induction on `chars`. Base case (`[]`) splits on the boundary form. Inductive case unfolds `readIdent` once, applies the ident-rest hypothesis, and recurses via IH with `acc ++ [d]`. **Difficulty**: Low (Path A) / Medium (Path B). ### Lemma 2: `readStrLit_app` **Statement**: ```lean theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) : fuel > (escapeStrLitBody s).length + 1 → readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc = some (acc ++ s.toList, rest) ``` Where `escapeStrLitBody s` is the inner part of `escapeStrLit` (without the surrounding `"`). **Proof**: induction on `s.toList`. Each char either escapes (`"` → `\"`, `\` → `\\`) or passes through. The `readStrLit` decoder reverses the escape exactly. **Difficulty**: Medium. The escape/unescape pairing is straightforward but tedious: 3 sub-cases per char (escape `"`, escape `\`, normal). ### Lemma 3: `tokenize_app_clean` **Statement**: ```lean theorem tokenize_app_clean (xs ys : List Char) : isCleanlyTerminated xs → tokenize (xs ++ ys) = tokenize xs ++ tokenize ys ``` Where `isCleanlyTerminated xs` says `xs` ends in a single-char-token (paren), whitespace, or string-literal close — *not* in the middle of an ident, num, or open string. **Proof**: induction on `xs`. Use the foundation tokenize lemmas (`tokenize_lparen_cons` etc.) for the constructor cases. The "cleanly terminated" predicate ensures no token spans the boundary. **Difficulty**: Medium. Defining `isCleanlyTerminated` precisely and proving each clean-prefix case takes care. ### Lemma 4: `tokenize_render_name` **Statement**: ```lean theorem tokenize_render_name : ∀ (n : Lean.Name), tokenize (nameToLeanSource n).toList = nameToTokens n ``` **Proof**: induction on `n`. - **`.anonymous`**: `nameToLeanSource .anonymous = "Lean.Name.anonymous"`. Tokenize this via `readIdent_app` (Lemma 1) with `rest = []`, gives `[.ident "Lean.Name.anonymous"]`. Equals `nameToTokens .anonymous`. - **`.str p s`**: `nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")"`. Use `String.toList_append` (Mathlib) or the chars-version of the renderer to break `(...)`.toList into: `['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']` Then apply `tokenize_app_clean` (Lemma 3) repeatedly: - `'('` → `lparen :: tokenize ...` - `"Lean.Name.str ".toList` → `ident "Lean.Name.str" :: tokenize ...` (via `readIdent_app` with rest starting at `' '`) - `(nameToLeanSource p).toList` → `nameToTokens p ++ tokenize ...` (via IH) - `[' ']` → `tokenize ...` (whitespace skip) - `(escapeStrLit s).toList` → `[strLit s] ++ tokenize ...` (via `readStrLit_app`) - `[')']` → `[rparen]` Combine: `[lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen]` = `nameToTokens (.str p s)`. ✓ - **`.num p k`**: analogous to `.str`, using `numLit` instead of `strLit`. **Difficulty**: Medium-High. Each step requires careful invocation of Lemmas 1–3. ### Lemma 5: `tokenize_render_classifier` **Statement**: ```lean theorem tokenize_render_classifier : ∀ (φ : MetaClassifier), tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ ``` **Proof**: 9 cases mirroring the 9 classifier arms. The atomic arms (`.always`, `.never`) are direct. The `.atDecl`/`.inFile`/`.underAttribute`/`.dependencyOf`/`.inNamespace` cases use Lemmas 1+3+4. The `.meet`/`.join` cases use Lemmas 1+3+IH. **Difficulty**: Medium. Mostly mechanical after Lemmas 1–4. ### Lemma 6: `tokenize_render_cterm` **Statement**: ```lean theorem tokenize_render_cterm : ∀ (t : MetaCTerm), tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t ``` **Proof**: 8 cases. Atomic: `.empty`. Compositional: `.ident` (uses Lemma 4), `.sym` (uses Lemma 2), `.app` (uses Lemma 6 IH), `.lam`/`.plam` (uses Lemma 2 + IH), `.comp`/`.transp` (uses Lemmas 2 + 5 + IH). **Difficulty**: Medium-High. Most complex case is `.comp` with 5 sub-fields chained. ### Lemma 7: `tokenize_render_artifact` **Statement**: ```lean theorem tokenize_render_artifact : ∀ (a : MetaArtifact), a.supported = true → tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a ``` **Proof**: 5 cases. `.empty`/`.source`: direct. `.cterm`: uses Lemma 6. `.refTo`: uses Lemma 4. `.declAt`: excluded by hypothesis. **Difficulty**: Low after Lemmas 4–6. ### Composing: the universal String-level round-trip ```lean theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm), MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by intro t unfold MetaCTerm.fromLeanSource? tokenizeStr rw [tokenize_render_cterm t] -- Phase 3: Lemma 6 exact cTermFromTokens?_round_trip t -- Phase 2 (already proven) ``` Two-line proof composing Phase 2 with Lemma 6. Same shape for the other three meta-mirror types. ## Estimated effort | Lemma | Path A (refactor) | Path B (Mathlib) | |-------|-------------------|------------------| | 1. `readIdent_app` | 30 lines proof + 20 lines refactor | 50 lines proof | | 2. `readStrLit_app` | 50 lines | 70 lines | | 3. `tokenize_app_clean` | 80 lines (incl. predicate def) | 80 lines | | 4. `tokenize_render_name` | 40 lines | 40 lines | | 5. `tokenize_render_classifier` | 90 lines | 90 lines | | 6. `tokenize_render_cterm` | 120 lines | 120 lines | | 7. `tokenize_render_artifact` | 30 lines | 30 lines | | Composition | 16 lines | 16 lines | | **Total** | **~470 lines** | **~500 lines** | Approximately **2–3 focused days** of Lean proof work for either path. ## Recommendation **Path A (refactor)** is preferred: - Stays within Lean core (no new dependencies). - The refactor is small (~20 lines) and independently sound. - Decouples *all* Phase 3 proofs from String internals — no future proof has to wrestle with `String.push`/`String.append` again. - The `List Char` accumulator is arguably more Eulerian: tokens are conceptually sequences of characters, and operating on lists makes the equational reasoning transparent. The refactor is itself the small Eulerian move; the rest of the proofs follow naturally. ## What this unlocks With Phase 3 complete, the bridge has: ``` ∀ t : MetaCTerm, fromLeanSource? (toLeanSource t) = some t ∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ ∀ n : Lean.Name, nameFromLeanSource? (nameToLeanSource n) = some n ∀ a : MetaArtifact, a.supported → fromLeanSource? (toLeanSource a) = some a ``` The full String-level universal round-trip — provably, by structural induction, with no closed-instance restrictions. The `Edit.apply` output (Lean source files written to disk) is then provably round-trippable: any consumer reading the file back gets the original `MetaCTerm`. This is the operational soundness of the entire bridge at the source level — what topolei or any external consumer needs to trust the edit-render-parse loop.