infoductor/docs/PHASE3_PROOF_PLAN.md
Maximus Gorog e908fc442f Document Phase 3 proof plan in docs/PHASE3_PROOF_PLAN.md
Comprehensive proof plan for the remaining Phase 3 lemmas (the
String-level universal round-trip).  Covers:

  · Current state — Phase 1, 2, length bounds, token-level
    universal, atomic decide witnesses, foundation tokenize lemmas.
  · The obstacle — Lean 4.30 String is UTF-8 ByteArray-backed,
    so `s.push c ++ xs.asString = s ++ (c :: xs).asString` is
    not definitionally true.
  · Two solution paths:
      A. Refactor `readIdent`/`readStrLit` to `List Char`
         accumulators (~20 line refactor + ~470 lines proof).
      B. Import Mathlib's structural String lemmas (~500 lines
         proof, no refactor).
  · Lemma-by-lemma proof plan with statements, sketches, and
    line-count estimates for each of the 7 lemmas.
  · Recommendation: Path A — stays in Lean core, decouples all
    future Phase 3 proofs from String internals.

Estimated effort: 2–3 focused days for either path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:15:32 -06:00

13 KiB
Raw Permalink Blame History

Phase 3 — String-Level Universal Round-Trip Proof Plan

Goal

Prove the universal theorem

∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v

(and parallel theorems for Lean.Name, MetaClassifier, MetaArtifact) without using decide, native_decide, or closed-instance hacks. Full structural induction over the meta-mirror types.

Current state

Proven (Foundation/MetaParse.lean)

Phase 1 — Canonical token forms (definitions):

  • nameToTokens : Lean.Name → List Token
  • MetaClassifier.toTokens : MetaClassifier → List Token
  • MetaCTerm.toTokens : MetaCTerm → List Token
  • MetaArtifact.toTokens : MetaArtifact → List Token

Phase 2 — Parser correctness on toTokens (universal, structural induction):

  • parseName?Aux_correct
  • parseClassifier?Aux_correct
  • parseMetaCTerm?Aux_correct
  • parseArtifact?Aux_correct (modulo .declAt)

Length-vs-depth lemmas:

  • nameToTokens_length_bound, classifierToTokens_length_bound, cTermToTokens_length_bound

Token-level universal round-trip (the headline results):

nameFromTokens?_round_trip       : ∀ n,        fromTokens? (toTokens n)   = some n
classifierFromTokens?_round_trip : ∀ φ,        fromTokens? φ.toTokens     = some φ
cTermFromTokens?_round_trip      : ∀ t,        fromTokens? t.toTokens     = some t
artifactFromTokens?_round_trip   : ∀ a, a.supported → fromTokens? a.toTokens = some a

Atomic Phase 3 witnesses via decide (kernel-rooted, closed inputs):

  • tokenize_render_name_anonymous
  • tokenize_render_classifier_always / _never
  • tokenize_render_cterm_empty
  • tokenize_render_artifact_empty

Foundation tokenize lemmas:

  • tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize rest
  • tokenize_rparen_cons (analogous for ))
  • tokenize_space_cons (whitespace skip)

Open (the four distribution lemmas)

readIdent_app    : -- ident chars + clean boundary → exact accumulation
readStrLit_app   : -- escapeStrLit body + close → recovers original String
tokenize_app_clean: -- tokenize distributes over clean concatenation
tokenize_render_X: -- by induction over each meta-mirror type

These compose to ∀ v, tokenize (toLeanSource v).toList = toTokens v, which combined with Phase 2 yields the full String-level universal.

The obstacle

Lean 4.30's String is UTF-8 ByteArray-backed, not a List Char structure. Consequently:

example (s : String) (c : Char) (xs : List Char) :
    s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl  -- FAILS

This identity is not definitionally true. The kernel sees String.push, String.append, and List.asString as primitive (or natively-implemented) operations that don't unfold to a common normal form via rfl.

This blocks the inductive step of readIdent_app: the inductive hypothesis returns (acc.push d ++ ds.asString, rest) but we want to conclude (acc ++ (d :: ds).asString, rest). The two are propositionally equal but require a String-internals lemma to bridge.

Two solution paths

Path A: Refactor to List Char accumulators

Replace the String accumulator in readIdent and readStrLit with a List Char. Convert to String only at the emission point in tokenizeAux.

Before:

def readIdent : Nat → List Char → String → String × List Char
  | n+1, c :: rest, acc =>
      if isIdentRestChar c then readIdent n rest (acc.push c)
      else (acc, c :: rest)

After:

def readIdent : Nat → List Char → List Char → List Char × List Char
  | n+1, c :: rest, acc =>
      if isIdentRestChar c then readIdent n rest (acc ++ [c])
      else (acc, c :: rest)

-- in tokenizeAux:
| n+1, c :: rest =>
    if isIdentStartChar c then
      let (consumed, rest') := readIdent (n+1) (c :: rest) []
      Token.ident consumed.asString :: tokenizeAux n rest'

Now readIdent_app becomes a pure List Char lemma, structurally provable:

theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat)
    (hf : fuel > chars.length)
    (hcs : ∀ c ∈ chars, isIdentRestChar c = true)
    (hr : rest = []  ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) :
    readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by
  induction chars generalizing acc fuel with
  | nil => rcases fuel with _ | f
           · simp at hf
           · simp [readIdent]
             cases hr with
             | inl h => subst h; simp [readIdent]
             | inr h => obtain ⟨c, r', hcr, hcb⟩ := h
                        subst hcr; simp [readIdent, hcb]
  | cons d ds ih =>
      rcases fuel with _ | f
      · simp at hf
      · have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _)
        have hds : ∀ c ∈ ds, isIdentRestChar c = true :=
          fun c hc => hcs c (List.mem_cons_of_mem _ hc)
        have hf' : f > ds.length := by simp [List.length_cons] at hf; omega
        have ih' := ih (acc ++ [d]) f hf' hds
        simp [readIdent, hd]
        rw [ih']
        simp [List.append_assoc]

The String-internals lemma is replaced by List.append_assoc — trivially provable.

Cost: ~50 lines refactor of readIdent/readStrLit/tokenizeAux, ~200 lines of proof.

Path B: Use Mathlib's String API

Mathlib provides the structural String lemmas needed:

  • String.toList_append : (s ++ t).toList = s.toList ++ t.toList
  • String.push_eq : s.push c = ⟨s.toList ++ [c]⟩ (or equivalent)
  • String.eq_iff_toList_eq (extensionality)

With these, the original String-accumulator readIdent can be proven correct without refactor. The proof structure is the same as Path A but uses the Mathlib lemmas to bridge between String.push and List.cons.

Cost: Add mathlib as a dependency to infoductor (already present transitively via Mathlib import in some Foundation files, but not directly used). Then ~200 lines of proof using Mathlib's String API.

Lemma-by-lemma proof plan

Lemma 1: readIdent_app

Statement (Path A version):

theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat),
    fuel > chars.length →
    (∀ c ∈ chars, isIdentRestChar c = true) →
    (rest = []  ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) →
    readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest)

Proof: structural induction on chars. Base case ([]) splits on the boundary form. Inductive case unfolds readIdent once, applies the ident-rest hypothesis, and recurses via IH with acc ++ [d].

Difficulty: Low (Path A) / Medium (Path B).

Lemma 2: readStrLit_app

Statement:

theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) :
    fuel > (escapeStrLitBody s).length + 1 →
    readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc =
      some (acc ++ s.toList, rest)

Where escapeStrLitBody s is the inner part of escapeStrLit (without the surrounding ").

Proof: induction on s.toList. Each char either escapes ("\", \\\) or passes through. The readStrLit decoder reverses the escape exactly.

Difficulty: Medium. The escape/unescape pairing is straightforward but tedious: 3 sub-cases per char (escape ", escape \, normal).

Lemma 3: tokenize_app_clean

Statement:

theorem tokenize_app_clean (xs ys : List Char) :
    isCleanlyTerminated xs →
    tokenize (xs ++ ys) = tokenize xs ++ tokenize ys

Where isCleanlyTerminated xs says xs ends in a single-char-token (paren), whitespace, or string-literal close — not in the middle of an ident, num, or open string.

Proof: induction on xs. Use the foundation tokenize lemmas (tokenize_lparen_cons etc.) for the constructor cases. The "cleanly terminated" predicate ensures no token spans the boundary.

Difficulty: Medium. Defining isCleanlyTerminated precisely and proving each clean-prefix case takes care.

Lemma 4: tokenize_render_name

Statement:

theorem tokenize_render_name : ∀ (n : Lean.Name),
    tokenize (nameToLeanSource n).toList = nameToTokens n

Proof: induction on n.

  • .anonymous: nameToLeanSource .anonymous = "Lean.Name.anonymous". Tokenize this via readIdent_app (Lemma 1) with rest = [], gives [.ident "Lean.Name.anonymous"]. Equals nameToTokens .anonymous.

  • .str p s: nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")".

    Use String.toList_append (Mathlib) or the chars-version of the renderer to break (...).toList into: ['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']

    Then apply tokenize_app_clean (Lemma 3) repeatedly:

    • '('lparen :: tokenize ...
    • "Lean.Name.str ".toListident "Lean.Name.str" :: tokenize ... (via readIdent_app with rest starting at ' ')
    • (nameToLeanSource p).toListnameToTokens p ++ tokenize ... (via IH)
    • [' ']tokenize ... (whitespace skip)
    • (escapeStrLit s).toList[strLit s] ++ tokenize ... (via readStrLit_app)
    • [')'][rparen]

    Combine: [lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen] = nameToTokens (.str p s). ✓

  • .num p k: analogous to .str, using numLit instead of strLit.

Difficulty: Medium-High. Each step requires careful invocation of Lemmas 13.

Lemma 5: tokenize_render_classifier

Statement:

theorem tokenize_render_classifier : ∀ (φ : MetaClassifier),
    tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ

Proof: 9 cases mirroring the 9 classifier arms. The atomic arms (.always, .never) are direct. The .atDecl/.inFile/.underAttribute/.dependencyOf/.inNamespace cases use Lemmas 1+3+4. The .meet/.join cases use Lemmas 1+3+IH.

Difficulty: Medium. Mostly mechanical after Lemmas 14.

Lemma 6: tokenize_render_cterm

Statement:

theorem tokenize_render_cterm : ∀ (t : MetaCTerm),
    tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t

Proof: 8 cases. Atomic: .empty. Compositional: .ident (uses Lemma 4), .sym (uses Lemma 2), .app (uses Lemma 6 IH), .lam/.plam (uses Lemma 2 + IH), .comp/.transp (uses Lemmas 2 + 5 + IH).

Difficulty: Medium-High. Most complex case is .comp with 5 sub-fields chained.

Lemma 7: tokenize_render_artifact

Statement:

theorem tokenize_render_artifact : ∀ (a : MetaArtifact),
    a.supported = true →
    tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a

Proof: 5 cases. .empty/.source: direct. .cterm: uses Lemma 6. .refTo: uses Lemma 4. .declAt: excluded by hypothesis.

Difficulty: Low after Lemmas 46.

Composing: the universal String-level round-trip

theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm),
    MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by
  intro t
  unfold MetaCTerm.fromLeanSource? tokenizeStr
  rw [tokenize_render_cterm t]                   -- Phase 3: Lemma 6
  exact cTermFromTokens?_round_trip t            -- Phase 2 (already proven)

Two-line proof composing Phase 2 with Lemma 6. Same shape for the other three meta-mirror types.

Estimated effort

Lemma Path A (refactor) Path B (Mathlib)
1. readIdent_app 30 lines proof + 20 lines refactor 50 lines proof
2. readStrLit_app 50 lines 70 lines
3. tokenize_app_clean 80 lines (incl. predicate def) 80 lines
4. tokenize_render_name 40 lines 40 lines
5. tokenize_render_classifier 90 lines 90 lines
6. tokenize_render_cterm 120 lines 120 lines
7. tokenize_render_artifact 30 lines 30 lines
Composition 16 lines 16 lines
Total ~470 lines ~500 lines

Approximately 23 focused days of Lean proof work for either path.

Recommendation

Path A (refactor) is preferred:

  • Stays within Lean core (no new dependencies).
  • The refactor is small (~20 lines) and independently sound.
  • Decouples all Phase 3 proofs from String internals — no future proof has to wrestle with String.push/String.append again.
  • The List Char accumulator is arguably more Eulerian: tokens are conceptually sequences of characters, and operating on lists makes the equational reasoning transparent.

The refactor is itself the small Eulerian move; the rest of the proofs follow naturally.

What this unlocks

With Phase 3 complete, the bridge has:

∀ t : MetaCTerm,     fromLeanSource? (toLeanSource t) = some t
∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ
∀ n : Lean.Name,     nameFromLeanSource? (nameToLeanSource n) = some n
∀ a : MetaArtifact,  a.supported → fromLeanSource? (toLeanSource a) = some a

The full String-level universal round-trip — provably, by structural induction, with no closed-instance restrictions. The Edit.apply output (Lean source files written to disk) is then provably round-trippable: any consumer reading the file back gets the original MetaCTerm.

This is the operational soundness of the entire bridge at the source level — what topolei or any external consumer needs to trust the edit-render-parse loop.