diff --git a/docs/PHASE3_PROOF_PLAN.md b/docs/PHASE3_PROOF_PLAN.md new file mode 100644 index 0000000..d32c064 --- /dev/null +++ b/docs/PHASE3_PROOF_PLAN.md @@ -0,0 +1,372 @@ +# Phase 3 — String-Level Universal Round-Trip Proof Plan + +## Goal + +Prove the universal theorem + +```lean +∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v +``` + +(and parallel theorems for `Lean.Name`, `MetaClassifier`, `MetaArtifact`) +without using `decide`, `native_decide`, or closed-instance hacks. Full +structural induction over the meta-mirror types. + +## Current state + +### Proven (`Foundation/MetaParse.lean`) + +**Phase 1 — Canonical token forms** (definitions): +- `nameToTokens : Lean.Name → List Token` +- `MetaClassifier.toTokens : MetaClassifier → List Token` +- `MetaCTerm.toTokens : MetaCTerm → List Token` +- `MetaArtifact.toTokens : MetaArtifact → List Token` + +**Phase 2 — Parser correctness on `toTokens`** (universal, structural induction): +- `parseName?Aux_correct` +- `parseClassifier?Aux_correct` +- `parseMetaCTerm?Aux_correct` +- `parseArtifact?Aux_correct` (modulo `.declAt`) + +**Length-vs-depth lemmas**: +- `nameToTokens_length_bound`, `classifierToTokens_length_bound`, + `cTermToTokens_length_bound` + +**Token-level universal round-trip** (the headline results): +```lean +nameFromTokens?_round_trip : ∀ n, fromTokens? (toTokens n) = some n +classifierFromTokens?_round_trip : ∀ φ, fromTokens? φ.toTokens = some φ +cTermFromTokens?_round_trip : ∀ t, fromTokens? t.toTokens = some t +artifactFromTokens?_round_trip : ∀ a, a.supported → fromTokens? a.toTokens = some a +``` + +**Atomic Phase 3 witnesses** via `decide` (kernel-rooted, closed inputs): +- `tokenize_render_name_anonymous` +- `tokenize_render_classifier_always` / `_never` +- `tokenize_render_cterm_empty` +- `tokenize_render_artifact_empty` + +**Foundation tokenize lemmas**: +- `tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize rest` +- `tokenize_rparen_cons` (analogous for `)`) +- `tokenize_space_cons` (whitespace skip) + +### Open (the four distribution lemmas) + +```lean +readIdent_app : -- ident chars + clean boundary → exact accumulation +readStrLit_app : -- escapeStrLit body + close → recovers original String +tokenize_app_clean: -- tokenize distributes over clean concatenation +tokenize_render_X: -- by induction over each meta-mirror type +``` + +These compose to `∀ v, tokenize (toLeanSource v).toList = toTokens v`, +which combined with Phase 2 yields the full String-level universal. + +## The obstacle + +Lean 4.30's `String` is **UTF-8 ByteArray-backed**, not a `List Char` +structure. Consequently: + +```lean +example (s : String) (c : Char) (xs : List Char) : + s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl -- FAILS +``` + +This identity is **not definitionally true**. The kernel sees `String.push`, +`String.append`, and `List.asString` as primitive (or natively-implemented) +operations that don't unfold to a common normal form via `rfl`. + +This blocks the inductive step of `readIdent_app`: the inductive +hypothesis returns `(acc.push d ++ ds.asString, rest)` but we want to +conclude `(acc ++ (d :: ds).asString, rest)`. The two are +**propositionally equal** but require a String-internals lemma to bridge. + +## Two solution paths + +### Path A: Refactor to `List Char` accumulators + +Replace the `String` accumulator in `readIdent` and `readStrLit` with a +`List Char`. Convert to `String` only at the emission point in +`tokenizeAux`. + +**Before**: +```lean +def readIdent : Nat → List Char → String → String × List Char + | n+1, c :: rest, acc => + if isIdentRestChar c then readIdent n rest (acc.push c) + else (acc, c :: rest) +``` + +**After**: +```lean +def readIdent : Nat → List Char → List Char → List Char × List Char + | n+1, c :: rest, acc => + if isIdentRestChar c then readIdent n rest (acc ++ [c]) + else (acc, c :: rest) + +-- in tokenizeAux: +| n+1, c :: rest => + if isIdentStartChar c then + let (consumed, rest') := readIdent (n+1) (c :: rest) [] + Token.ident consumed.asString :: tokenizeAux n rest' +``` + +Now `readIdent_app` becomes a pure `List Char` lemma, structurally provable: + +```lean +theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat) + (hf : fuel > chars.length) + (hcs : ∀ c ∈ chars, isIdentRestChar c = true) + (hr : rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) : + readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by + induction chars generalizing acc fuel with + | nil => rcases fuel with _ | f + · simp at hf + · simp [readIdent] + cases hr with + | inl h => subst h; simp [readIdent] + | inr h => obtain ⟨c, r', hcr, hcb⟩ := h + subst hcr; simp [readIdent, hcb] + | cons d ds ih => + rcases fuel with _ | f + · simp at hf + · have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _) + have hds : ∀ c ∈ ds, isIdentRestChar c = true := + fun c hc => hcs c (List.mem_cons_of_mem _ hc) + have hf' : f > ds.length := by simp [List.length_cons] at hf; omega + have ih' := ih (acc ++ [d]) f hf' hds + simp [readIdent, hd] + rw [ih'] + simp [List.append_assoc] +``` + +The String-internals lemma is replaced by `List.append_assoc` — +trivially provable. + +**Cost**: ~50 lines refactor of `readIdent`/`readStrLit`/`tokenizeAux`, +~200 lines of proof. + +### Path B: Use Mathlib's String API + +Mathlib provides the structural String lemmas needed: +- `String.toList_append : (s ++ t).toList = s.toList ++ t.toList` +- `String.push_eq : s.push c = ⟨s.toList ++ [c]⟩` (or equivalent) +- `String.eq_iff_toList_eq` (extensionality) + +With these, the original `String`-accumulator `readIdent` can be +proven correct without refactor. The proof structure is the same as +Path A but uses the Mathlib lemmas to bridge between `String.push` +and `List.cons`. + +**Cost**: Add `mathlib` as a dependency to `infoductor` (already +present transitively via `Mathlib` import in some Foundation files, +but not directly used). Then ~200 lines of proof using Mathlib's +String API. + +## Lemma-by-lemma proof plan + +### Lemma 1: `readIdent_app` + +**Statement** (Path A version): +```lean +theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat), + fuel > chars.length → + (∀ c ∈ chars, isIdentRestChar c = true) → + (rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) → + readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) +``` + +**Proof**: structural induction on `chars`. Base case (`[]`) splits on +the boundary form. Inductive case unfolds `readIdent` once, applies the +ident-rest hypothesis, and recurses via IH with `acc ++ [d]`. + +**Difficulty**: Low (Path A) / Medium (Path B). + +### Lemma 2: `readStrLit_app` + +**Statement**: +```lean +theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) : + fuel > (escapeStrLitBody s).length + 1 → + readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc = + some (acc ++ s.toList, rest) +``` + +Where `escapeStrLitBody s` is the inner part of `escapeStrLit` (without +the surrounding `"`). + +**Proof**: induction on `s.toList`. Each char either escapes (`"` → `\"`, +`\` → `\\`) or passes through. The `readStrLit` decoder reverses the +escape exactly. + +**Difficulty**: Medium. The escape/unescape pairing is straightforward +but tedious: 3 sub-cases per char (escape `"`, escape `\`, normal). + +### Lemma 3: `tokenize_app_clean` + +**Statement**: +```lean +theorem tokenize_app_clean (xs ys : List Char) : + isCleanlyTerminated xs → + tokenize (xs ++ ys) = tokenize xs ++ tokenize ys +``` + +Where `isCleanlyTerminated xs` says `xs` ends in a single-char-token +(paren), whitespace, or string-literal close — *not* in the middle of +an ident, num, or open string. + +**Proof**: induction on `xs`. Use the foundation tokenize lemmas +(`tokenize_lparen_cons` etc.) for the constructor cases. The +"cleanly terminated" predicate ensures no token spans the boundary. + +**Difficulty**: Medium. Defining `isCleanlyTerminated` precisely and +proving each clean-prefix case takes care. + +### Lemma 4: `tokenize_render_name` + +**Statement**: +```lean +theorem tokenize_render_name : ∀ (n : Lean.Name), + tokenize (nameToLeanSource n).toList = nameToTokens n +``` + +**Proof**: induction on `n`. + +- **`.anonymous`**: `nameToLeanSource .anonymous = "Lean.Name.anonymous"`. + Tokenize this via `readIdent_app` (Lemma 1) with `rest = []`, gives + `[.ident "Lean.Name.anonymous"]`. Equals `nameToTokens .anonymous`. + +- **`.str p s`**: `nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")"`. + + Use `String.toList_append` (Mathlib) or the chars-version of the + renderer to break `(...)`.toList into: + `['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']` + + Then apply `tokenize_app_clean` (Lemma 3) repeatedly: + - `'('` → `lparen :: tokenize ...` + - `"Lean.Name.str ".toList` → `ident "Lean.Name.str" :: tokenize ...` + (via `readIdent_app` with rest starting at `' '`) + - `(nameToLeanSource p).toList` → `nameToTokens p ++ tokenize ...` + (via IH) + - `[' ']` → `tokenize ...` (whitespace skip) + - `(escapeStrLit s).toList` → `[strLit s] ++ tokenize ...` + (via `readStrLit_app`) + - `[')']` → `[rparen]` + + Combine: `[lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen]` + = `nameToTokens (.str p s)`. ✓ + +- **`.num p k`**: analogous to `.str`, using `numLit` instead of `strLit`. + +**Difficulty**: Medium-High. Each step requires careful invocation of +Lemmas 1–3. + +### Lemma 5: `tokenize_render_classifier` + +**Statement**: +```lean +theorem tokenize_render_classifier : ∀ (φ : MetaClassifier), + tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ +``` + +**Proof**: 9 cases mirroring the 9 classifier arms. The atomic arms +(`.always`, `.never`) are direct. The `.atDecl`/`.inFile`/`.underAttribute`/`.dependencyOf`/`.inNamespace` cases use Lemmas 1+3+4. The `.meet`/`.join` +cases use Lemmas 1+3+IH. + +**Difficulty**: Medium. Mostly mechanical after Lemmas 1–4. + +### Lemma 6: `tokenize_render_cterm` + +**Statement**: +```lean +theorem tokenize_render_cterm : ∀ (t : MetaCTerm), + tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t +``` + +**Proof**: 8 cases. Atomic: `.empty`. Compositional: `.ident` (uses +Lemma 4), `.sym` (uses Lemma 2), `.app` (uses Lemma 6 IH), `.lam`/`.plam` +(uses Lemma 2 + IH), `.comp`/`.transp` (uses Lemmas 2 + 5 + IH). + +**Difficulty**: Medium-High. Most complex case is `.comp` with 5 +sub-fields chained. + +### Lemma 7: `tokenize_render_artifact` + +**Statement**: +```lean +theorem tokenize_render_artifact : ∀ (a : MetaArtifact), + a.supported = true → + tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a +``` + +**Proof**: 5 cases. `.empty`/`.source`: direct. `.cterm`: uses Lemma 6. +`.refTo`: uses Lemma 4. `.declAt`: excluded by hypothesis. + +**Difficulty**: Low after Lemmas 4–6. + +### Composing: the universal String-level round-trip + +```lean +theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm), + MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by + intro t + unfold MetaCTerm.fromLeanSource? tokenizeStr + rw [tokenize_render_cterm t] -- Phase 3: Lemma 6 + exact cTermFromTokens?_round_trip t -- Phase 2 (already proven) +``` + +Two-line proof composing Phase 2 with Lemma 6. Same shape for the +other three meta-mirror types. + +## Estimated effort + +| Lemma | Path A (refactor) | Path B (Mathlib) | +|-------|-------------------|------------------| +| 1. `readIdent_app` | 30 lines proof + 20 lines refactor | 50 lines proof | +| 2. `readStrLit_app` | 50 lines | 70 lines | +| 3. `tokenize_app_clean` | 80 lines (incl. predicate def) | 80 lines | +| 4. `tokenize_render_name` | 40 lines | 40 lines | +| 5. `tokenize_render_classifier` | 90 lines | 90 lines | +| 6. `tokenize_render_cterm` | 120 lines | 120 lines | +| 7. `tokenize_render_artifact` | 30 lines | 30 lines | +| Composition | 16 lines | 16 lines | +| **Total** | **~470 lines** | **~500 lines** | + +Approximately **2–3 focused days** of Lean proof work for either path. + +## Recommendation + +**Path A (refactor)** is preferred: + +- Stays within Lean core (no new dependencies). +- The refactor is small (~20 lines) and independently sound. +- Decouples *all* Phase 3 proofs from String internals — no future + proof has to wrestle with `String.push`/`String.append` again. +- The `List Char` accumulator is arguably more Eulerian: tokens are + conceptually sequences of characters, and operating on lists makes + the equational reasoning transparent. + +The refactor is itself the small Eulerian move; the rest of the +proofs follow naturally. + +## What this unlocks + +With Phase 3 complete, the bridge has: + +``` +∀ t : MetaCTerm, fromLeanSource? (toLeanSource t) = some t +∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ +∀ n : Lean.Name, nameFromLeanSource? (nameToLeanSource n) = some n +∀ a : MetaArtifact, a.supported → fromLeanSource? (toLeanSource a) = some a +``` + +The full String-level universal round-trip — provably, by structural +induction, with no closed-instance restrictions. The `Edit.apply` +output (Lean source files written to disk) is then provably +round-trippable: any consumer reading the file back gets the original +`MetaCTerm`. + +This is the operational soundness of the entire bridge at the source +level — what topolei or any external consumer needs to trust the +edit-render-parse loop.