infoductor/docs/PHASE3_PROOF_PLAN.md
Maximus Gorog e908fc442f Document Phase 3 proof plan in docs/PHASE3_PROOF_PLAN.md
Comprehensive proof plan for the remaining Phase 3 lemmas (the
String-level universal round-trip).  Covers:

  · Current state — Phase 1, 2, length bounds, token-level
    universal, atomic decide witnesses, foundation tokenize lemmas.
  · The obstacle — Lean 4.30 String is UTF-8 ByteArray-backed,
    so `s.push c ++ xs.asString = s ++ (c :: xs).asString` is
    not definitionally true.
  · Two solution paths:
      A. Refactor `readIdent`/`readStrLit` to `List Char`
         accumulators (~20 line refactor + ~470 lines proof).
      B. Import Mathlib's structural String lemmas (~500 lines
         proof, no refactor).
  · Lemma-by-lemma proof plan with statements, sketches, and
    line-count estimates for each of the 7 lemmas.
  · Recommendation: Path A — stays in Lean core, decouples all
    future Phase 3 proofs from String internals.

Estimated effort: 2–3 focused days for either path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:15:32 -06:00

372 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 3 — String-Level Universal Round-Trip Proof Plan
## Goal
Prove the universal theorem
```lean
∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v
```
(and parallel theorems for `Lean.Name`, `MetaClassifier`, `MetaArtifact`)
without using `decide`, `native_decide`, or closed-instance hacks. Full
structural induction over the meta-mirror types.
## Current state
### Proven (`Foundation/MetaParse.lean`)
**Phase 1 — Canonical token forms** (definitions):
- `nameToTokens : Lean.Name → List Token`
- `MetaClassifier.toTokens : MetaClassifier → List Token`
- `MetaCTerm.toTokens : MetaCTerm → List Token`
- `MetaArtifact.toTokens : MetaArtifact → List Token`
**Phase 2 — Parser correctness on `toTokens`** (universal, structural induction):
- `parseName?Aux_correct`
- `parseClassifier?Aux_correct`
- `parseMetaCTerm?Aux_correct`
- `parseArtifact?Aux_correct` (modulo `.declAt`)
**Length-vs-depth lemmas**:
- `nameToTokens_length_bound`, `classifierToTokens_length_bound`,
`cTermToTokens_length_bound`
**Token-level universal round-trip** (the headline results):
```lean
nameFromTokens?_round_trip : ∀ n, fromTokens? (toTokens n) = some n
classifierFromTokens?_round_trip : ∀ φ, fromTokens? φ.toTokens = some φ
cTermFromTokens?_round_trip : ∀ t, fromTokens? t.toTokens = some t
artifactFromTokens?_round_trip : ∀ a, a.supported → fromTokens? a.toTokens = some a
```
**Atomic Phase 3 witnesses** via `decide` (kernel-rooted, closed inputs):
- `tokenize_render_name_anonymous`
- `tokenize_render_classifier_always` / `_never`
- `tokenize_render_cterm_empty`
- `tokenize_render_artifact_empty`
**Foundation tokenize lemmas**:
- `tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize rest`
- `tokenize_rparen_cons` (analogous for `)`)
- `tokenize_space_cons` (whitespace skip)
### Open (the four distribution lemmas)
```lean
readIdent_app : -- ident chars + clean boundary → exact accumulation
readStrLit_app : -- escapeStrLit body + close → recovers original String
tokenize_app_clean: -- tokenize distributes over clean concatenation
tokenize_render_X: -- by induction over each meta-mirror type
```
These compose to `∀ v, tokenize (toLeanSource v).toList = toTokens v`,
which combined with Phase 2 yields the full String-level universal.
## The obstacle
Lean 4.30's `String` is **UTF-8 ByteArray-backed**, not a `List Char`
structure. Consequently:
```lean
example (s : String) (c : Char) (xs : List Char) :
s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl -- FAILS
```
This identity is **not definitionally true**. The kernel sees `String.push`,
`String.append`, and `List.asString` as primitive (or natively-implemented)
operations that don't unfold to a common normal form via `rfl`.
This blocks the inductive step of `readIdent_app`: the inductive
hypothesis returns `(acc.push d ++ ds.asString, rest)` but we want to
conclude `(acc ++ (d :: ds).asString, rest)`. The two are
**propositionally equal** but require a String-internals lemma to bridge.
## Two solution paths
### Path A: Refactor to `List Char` accumulators
Replace the `String` accumulator in `readIdent` and `readStrLit` with a
`List Char`. Convert to `String` only at the emission point in
`tokenizeAux`.
**Before**:
```lean
def readIdent : Nat → List Char → String → String × List Char
| n+1, c :: rest, acc =>
if isIdentRestChar c then readIdent n rest (acc.push c)
else (acc, c :: rest)
```
**After**:
```lean
def readIdent : Nat → List Char → List Char → List Char × List Char
| n+1, c :: rest, acc =>
if isIdentRestChar c then readIdent n rest (acc ++ [c])
else (acc, c :: rest)
-- in tokenizeAux:
| n+1, c :: rest =>
if isIdentStartChar c then
let (consumed, rest') := readIdent (n+1) (c :: rest) []
Token.ident consumed.asString :: tokenizeAux n rest'
```
Now `readIdent_app` becomes a pure `List Char` lemma, structurally provable:
```lean
theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat)
(hf : fuel > chars.length)
(hcs : ∀ c ∈ chars, isIdentRestChar c = true)
(hr : rest = [] ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) :
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by
induction chars generalizing acc fuel with
| nil => rcases fuel with _ | f
· simp at hf
· simp [readIdent]
cases hr with
| inl h => subst h; simp [readIdent]
| inr h => obtain ⟨c, r', hcr, hcb⟩ := h
subst hcr; simp [readIdent, hcb]
| cons d ds ih =>
rcases fuel with _ | f
· simp at hf
· have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _)
have hds : ∀ c ∈ ds, isIdentRestChar c = true :=
fun c hc => hcs c (List.mem_cons_of_mem _ hc)
have hf' : f > ds.length := by simp [List.length_cons] at hf; omega
have ih' := ih (acc ++ [d]) f hf' hds
simp [readIdent, hd]
rw [ih']
simp [List.append_assoc]
```
The String-internals lemma is replaced by `List.append_assoc`
trivially provable.
**Cost**: ~50 lines refactor of `readIdent`/`readStrLit`/`tokenizeAux`,
~200 lines of proof.
### Path B: Use Mathlib's String API
Mathlib provides the structural String lemmas needed:
- `String.toList_append : (s ++ t).toList = s.toList ++ t.toList`
- `String.push_eq : s.push c = ⟨s.toList ++ [c]⟩` (or equivalent)
- `String.eq_iff_toList_eq` (extensionality)
With these, the original `String`-accumulator `readIdent` can be
proven correct without refactor. The proof structure is the same as
Path A but uses the Mathlib lemmas to bridge between `String.push`
and `List.cons`.
**Cost**: Add `mathlib` as a dependency to `infoductor` (already
present transitively via `Mathlib` import in some Foundation files,
but not directly used). Then ~200 lines of proof using Mathlib's
String API.
## Lemma-by-lemma proof plan
### Lemma 1: `readIdent_app`
**Statement** (Path A version):
```lean
theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat),
fuel > chars.length →
(∀ c ∈ chars, isIdentRestChar c = true) →
(rest = [] ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) →
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest)
```
**Proof**: structural induction on `chars`. Base case (`[]`) splits on
the boundary form. Inductive case unfolds `readIdent` once, applies the
ident-rest hypothesis, and recurses via IH with `acc ++ [d]`.
**Difficulty**: Low (Path A) / Medium (Path B).
### Lemma 2: `readStrLit_app`
**Statement**:
```lean
theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) :
fuel > (escapeStrLitBody s).length + 1 →
readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc =
some (acc ++ s.toList, rest)
```
Where `escapeStrLitBody s` is the inner part of `escapeStrLit` (without
the surrounding `"`).
**Proof**: induction on `s.toList`. Each char either escapes (`"``\"`,
`\``\\`) or passes through. The `readStrLit` decoder reverses the
escape exactly.
**Difficulty**: Medium. The escape/unescape pairing is straightforward
but tedious: 3 sub-cases per char (escape `"`, escape `\`, normal).
### Lemma 3: `tokenize_app_clean`
**Statement**:
```lean
theorem tokenize_app_clean (xs ys : List Char) :
isCleanlyTerminated xs →
tokenize (xs ++ ys) = tokenize xs ++ tokenize ys
```
Where `isCleanlyTerminated xs` says `xs` ends in a single-char-token
(paren), whitespace, or string-literal close — *not* in the middle of
an ident, num, or open string.
**Proof**: induction on `xs`. Use the foundation tokenize lemmas
(`tokenize_lparen_cons` etc.) for the constructor cases. The
"cleanly terminated" predicate ensures no token spans the boundary.
**Difficulty**: Medium. Defining `isCleanlyTerminated` precisely and
proving each clean-prefix case takes care.
### Lemma 4: `tokenize_render_name`
**Statement**:
```lean
theorem tokenize_render_name : ∀ (n : Lean.Name),
tokenize (nameToLeanSource n).toList = nameToTokens n
```
**Proof**: induction on `n`.
- **`.anonymous`**: `nameToLeanSource .anonymous = "Lean.Name.anonymous"`.
Tokenize this via `readIdent_app` (Lemma 1) with `rest = []`, gives
`[.ident "Lean.Name.anonymous"]`. Equals `nameToTokens .anonymous`.
- **`.str p s`**: `nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")"`.
Use `String.toList_append` (Mathlib) or the chars-version of the
renderer to break `(...)`.toList into:
`['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']`
Then apply `tokenize_app_clean` (Lemma 3) repeatedly:
- `'('``lparen :: tokenize ...`
- `"Lean.Name.str ".toList``ident "Lean.Name.str" :: tokenize ...`
(via `readIdent_app` with rest starting at `' '`)
- `(nameToLeanSource p).toList``nameToTokens p ++ tokenize ...`
(via IH)
- `[' ']``tokenize ...` (whitespace skip)
- `(escapeStrLit s).toList``[strLit s] ++ tokenize ...`
(via `readStrLit_app`)
- `[')']``[rparen]`
Combine: `[lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen]`
= `nameToTokens (.str p s)`. ✓
- **`.num p k`**: analogous to `.str`, using `numLit` instead of `strLit`.
**Difficulty**: Medium-High. Each step requires careful invocation of
Lemmas 13.
### Lemma 5: `tokenize_render_classifier`
**Statement**:
```lean
theorem tokenize_render_classifier : ∀ (φ : MetaClassifier),
tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ
```
**Proof**: 9 cases mirroring the 9 classifier arms. The atomic arms
(`.always`, `.never`) are direct. The `.atDecl`/`.inFile`/`.underAttribute`/`.dependencyOf`/`.inNamespace` cases use Lemmas 1+3+4. The `.meet`/`.join`
cases use Lemmas 1+3+IH.
**Difficulty**: Medium. Mostly mechanical after Lemmas 14.
### Lemma 6: `tokenize_render_cterm`
**Statement**:
```lean
theorem tokenize_render_cterm : ∀ (t : MetaCTerm),
tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t
```
**Proof**: 8 cases. Atomic: `.empty`. Compositional: `.ident` (uses
Lemma 4), `.sym` (uses Lemma 2), `.app` (uses Lemma 6 IH), `.lam`/`.plam`
(uses Lemma 2 + IH), `.comp`/`.transp` (uses Lemmas 2 + 5 + IH).
**Difficulty**: Medium-High. Most complex case is `.comp` with 5
sub-fields chained.
### Lemma 7: `tokenize_render_artifact`
**Statement**:
```lean
theorem tokenize_render_artifact : ∀ (a : MetaArtifact),
a.supported = true →
tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a
```
**Proof**: 5 cases. `.empty`/`.source`: direct. `.cterm`: uses Lemma 6.
`.refTo`: uses Lemma 4. `.declAt`: excluded by hypothesis.
**Difficulty**: Low after Lemmas 46.
### Composing: the universal String-level round-trip
```lean
theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm),
MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by
intro t
unfold MetaCTerm.fromLeanSource? tokenizeStr
rw [tokenize_render_cterm t] -- Phase 3: Lemma 6
exact cTermFromTokens?_round_trip t -- Phase 2 (already proven)
```
Two-line proof composing Phase 2 with Lemma 6. Same shape for the
other three meta-mirror types.
## Estimated effort
| Lemma | Path A (refactor) | Path B (Mathlib) |
|-------|-------------------|------------------|
| 1. `readIdent_app` | 30 lines proof + 20 lines refactor | 50 lines proof |
| 2. `readStrLit_app` | 50 lines | 70 lines |
| 3. `tokenize_app_clean` | 80 lines (incl. predicate def) | 80 lines |
| 4. `tokenize_render_name` | 40 lines | 40 lines |
| 5. `tokenize_render_classifier` | 90 lines | 90 lines |
| 6. `tokenize_render_cterm` | 120 lines | 120 lines |
| 7. `tokenize_render_artifact` | 30 lines | 30 lines |
| Composition | 16 lines | 16 lines |
| **Total** | **~470 lines** | **~500 lines** |
Approximately **23 focused days** of Lean proof work for either path.
## Recommendation
**Path A (refactor)** is preferred:
- Stays within Lean core (no new dependencies).
- The refactor is small (~20 lines) and independently sound.
- Decouples *all* Phase 3 proofs from String internals — no future
proof has to wrestle with `String.push`/`String.append` again.
- The `List Char` accumulator is arguably more Eulerian: tokens are
conceptually sequences of characters, and operating on lists makes
the equational reasoning transparent.
The refactor is itself the small Eulerian move; the rest of the
proofs follow naturally.
## What this unlocks
With Phase 3 complete, the bridge has:
```
∀ t : MetaCTerm, fromLeanSource? (toLeanSource t) = some t
∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ
∀ n : Lean.Name, nameFromLeanSource? (nameToLeanSource n) = some n
∀ a : MetaArtifact, a.supported → fromLeanSource? (toLeanSource a) = some a
```
The full String-level universal round-trip — provably, by structural
induction, with no closed-instance restrictions. The `Edit.apply`
output (Lean source files written to disk) is then provably
round-trippable: any consumer reading the file back gets the original
`MetaCTerm`.
This is the operational soundness of the entire bridge at the source
level — what topolei or any external consumer needs to trust the
edit-render-parse loop.