Document Phase 3 proof plan in docs/PHASE3_PROOF_PLAN.md
Comprehensive proof plan for the remaining Phase 3 lemmas (the
String-level universal round-trip). Covers:
· Current state — Phase 1, 2, length bounds, token-level
universal, atomic decide witnesses, foundation tokenize lemmas.
· The obstacle — Lean 4.30 String is UTF-8 ByteArray-backed,
so `s.push c ++ xs.asString = s ++ (c :: xs).asString` is
not definitionally true.
· Two solution paths:
A. Refactor `readIdent`/`readStrLit` to `List Char`
accumulators (~20 line refactor + ~470 lines proof).
B. Import Mathlib's structural String lemmas (~500 lines
proof, no refactor).
· Lemma-by-lemma proof plan with statements, sketches, and
line-count estimates for each of the 7 lemmas.
· Recommendation: Path A — stays in Lean core, decouples all
future Phase 3 proofs from String internals.
Estimated effort: 2–3 focused days for either path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
665046a353
commit
e908fc442f
1 changed files with 372 additions and 0 deletions
372
docs/PHASE3_PROOF_PLAN.md
Normal file
372
docs/PHASE3_PROOF_PLAN.md
Normal file
|
|
@ -0,0 +1,372 @@
|
||||||
|
# Phase 3 — String-Level Universal Round-Trip Proof Plan
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Prove the universal theorem
|
||||||
|
|
||||||
|
```lean
|
||||||
|
∀ v, MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource v) = some v
|
||||||
|
```
|
||||||
|
|
||||||
|
(and parallel theorems for `Lean.Name`, `MetaClassifier`, `MetaArtifact`)
|
||||||
|
without using `decide`, `native_decide`, or closed-instance hacks. Full
|
||||||
|
structural induction over the meta-mirror types.
|
||||||
|
|
||||||
|
## Current state
|
||||||
|
|
||||||
|
### Proven (`Foundation/MetaParse.lean`)
|
||||||
|
|
||||||
|
**Phase 1 — Canonical token forms** (definitions):
|
||||||
|
- `nameToTokens : Lean.Name → List Token`
|
||||||
|
- `MetaClassifier.toTokens : MetaClassifier → List Token`
|
||||||
|
- `MetaCTerm.toTokens : MetaCTerm → List Token`
|
||||||
|
- `MetaArtifact.toTokens : MetaArtifact → List Token`
|
||||||
|
|
||||||
|
**Phase 2 — Parser correctness on `toTokens`** (universal, structural induction):
|
||||||
|
- `parseName?Aux_correct`
|
||||||
|
- `parseClassifier?Aux_correct`
|
||||||
|
- `parseMetaCTerm?Aux_correct`
|
||||||
|
- `parseArtifact?Aux_correct` (modulo `.declAt`)
|
||||||
|
|
||||||
|
**Length-vs-depth lemmas**:
|
||||||
|
- `nameToTokens_length_bound`, `classifierToTokens_length_bound`,
|
||||||
|
`cTermToTokens_length_bound`
|
||||||
|
|
||||||
|
**Token-level universal round-trip** (the headline results):
|
||||||
|
```lean
|
||||||
|
nameFromTokens?_round_trip : ∀ n, fromTokens? (toTokens n) = some n
|
||||||
|
classifierFromTokens?_round_trip : ∀ φ, fromTokens? φ.toTokens = some φ
|
||||||
|
cTermFromTokens?_round_trip : ∀ t, fromTokens? t.toTokens = some t
|
||||||
|
artifactFromTokens?_round_trip : ∀ a, a.supported → fromTokens? a.toTokens = some a
|
||||||
|
```
|
||||||
|
|
||||||
|
**Atomic Phase 3 witnesses** via `decide` (kernel-rooted, closed inputs):
|
||||||
|
- `tokenize_render_name_anonymous`
|
||||||
|
- `tokenize_render_classifier_always` / `_never`
|
||||||
|
- `tokenize_render_cterm_empty`
|
||||||
|
- `tokenize_render_artifact_empty`
|
||||||
|
|
||||||
|
**Foundation tokenize lemmas**:
|
||||||
|
- `tokenize_lparen_cons : tokenize ('(' :: rest) = .lparen :: tokenize rest`
|
||||||
|
- `tokenize_rparen_cons` (analogous for `)`)
|
||||||
|
- `tokenize_space_cons` (whitespace skip)
|
||||||
|
|
||||||
|
### Open (the four distribution lemmas)
|
||||||
|
|
||||||
|
```lean
|
||||||
|
readIdent_app : -- ident chars + clean boundary → exact accumulation
|
||||||
|
readStrLit_app : -- escapeStrLit body + close → recovers original String
|
||||||
|
tokenize_app_clean: -- tokenize distributes over clean concatenation
|
||||||
|
tokenize_render_X: -- by induction over each meta-mirror type
|
||||||
|
```
|
||||||
|
|
||||||
|
These compose to `∀ v, tokenize (toLeanSource v).toList = toTokens v`,
|
||||||
|
which combined with Phase 2 yields the full String-level universal.
|
||||||
|
|
||||||
|
## The obstacle
|
||||||
|
|
||||||
|
Lean 4.30's `String` is **UTF-8 ByteArray-backed**, not a `List Char`
|
||||||
|
structure. Consequently:
|
||||||
|
|
||||||
|
```lean
|
||||||
|
example (s : String) (c : Char) (xs : List Char) :
|
||||||
|
s.push c ++ xs.asString = s ++ (c :: xs).asString := by rfl -- FAILS
|
||||||
|
```
|
||||||
|
|
||||||
|
This identity is **not definitionally true**. The kernel sees `String.push`,
|
||||||
|
`String.append`, and `List.asString` as primitive (or natively-implemented)
|
||||||
|
operations that don't unfold to a common normal form via `rfl`.
|
||||||
|
|
||||||
|
This blocks the inductive step of `readIdent_app`: the inductive
|
||||||
|
hypothesis returns `(acc.push d ++ ds.asString, rest)` but we want to
|
||||||
|
conclude `(acc ++ (d :: ds).asString, rest)`. The two are
|
||||||
|
**propositionally equal** but require a String-internals lemma to bridge.
|
||||||
|
|
||||||
|
## Two solution paths
|
||||||
|
|
||||||
|
### Path A: Refactor to `List Char` accumulators
|
||||||
|
|
||||||
|
Replace the `String` accumulator in `readIdent` and `readStrLit` with a
|
||||||
|
`List Char`. Convert to `String` only at the emission point in
|
||||||
|
`tokenizeAux`.
|
||||||
|
|
||||||
|
**Before**:
|
||||||
|
```lean
|
||||||
|
def readIdent : Nat → List Char → String → String × List Char
|
||||||
|
| n+1, c :: rest, acc =>
|
||||||
|
if isIdentRestChar c then readIdent n rest (acc.push c)
|
||||||
|
else (acc, c :: rest)
|
||||||
|
```
|
||||||
|
|
||||||
|
**After**:
|
||||||
|
```lean
|
||||||
|
def readIdent : Nat → List Char → List Char → List Char × List Char
|
||||||
|
| n+1, c :: rest, acc =>
|
||||||
|
if isIdentRestChar c then readIdent n rest (acc ++ [c])
|
||||||
|
else (acc, c :: rest)
|
||||||
|
|
||||||
|
-- in tokenizeAux:
|
||||||
|
| n+1, c :: rest =>
|
||||||
|
if isIdentStartChar c then
|
||||||
|
let (consumed, rest') := readIdent (n+1) (c :: rest) []
|
||||||
|
Token.ident consumed.asString :: tokenizeAux n rest'
|
||||||
|
```
|
||||||
|
|
||||||
|
Now `readIdent_app` becomes a pure `List Char` lemma, structurally provable:
|
||||||
|
|
||||||
|
```lean
|
||||||
|
theorem readIdent_app (chars rest : List Char) (acc : List Char) (fuel : Nat)
|
||||||
|
(hf : fuel > chars.length)
|
||||||
|
(hcs : ∀ c ∈ chars, isIdentRestChar c = true)
|
||||||
|
(hr : rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) :
|
||||||
|
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest) := by
|
||||||
|
induction chars generalizing acc fuel with
|
||||||
|
| nil => rcases fuel with _ | f
|
||||||
|
· simp at hf
|
||||||
|
· simp [readIdent]
|
||||||
|
cases hr with
|
||||||
|
| inl h => subst h; simp [readIdent]
|
||||||
|
| inr h => obtain ⟨c, r', hcr, hcb⟩ := h
|
||||||
|
subst hcr; simp [readIdent, hcb]
|
||||||
|
| cons d ds ih =>
|
||||||
|
rcases fuel with _ | f
|
||||||
|
· simp at hf
|
||||||
|
· have hd : isIdentRestChar d = true := hcs d (List.mem_cons_self _ _)
|
||||||
|
have hds : ∀ c ∈ ds, isIdentRestChar c = true :=
|
||||||
|
fun c hc => hcs c (List.mem_cons_of_mem _ hc)
|
||||||
|
have hf' : f > ds.length := by simp [List.length_cons] at hf; omega
|
||||||
|
have ih' := ih (acc ++ [d]) f hf' hds
|
||||||
|
simp [readIdent, hd]
|
||||||
|
rw [ih']
|
||||||
|
simp [List.append_assoc]
|
||||||
|
```
|
||||||
|
|
||||||
|
The String-internals lemma is replaced by `List.append_assoc` —
|
||||||
|
trivially provable.
|
||||||
|
|
||||||
|
**Cost**: ~50 lines refactor of `readIdent`/`readStrLit`/`tokenizeAux`,
|
||||||
|
~200 lines of proof.
|
||||||
|
|
||||||
|
### Path B: Use Mathlib's String API
|
||||||
|
|
||||||
|
Mathlib provides the structural String lemmas needed:
|
||||||
|
- `String.toList_append : (s ++ t).toList = s.toList ++ t.toList`
|
||||||
|
- `String.push_eq : s.push c = ⟨s.toList ++ [c]⟩` (or equivalent)
|
||||||
|
- `String.eq_iff_toList_eq` (extensionality)
|
||||||
|
|
||||||
|
With these, the original `String`-accumulator `readIdent` can be
|
||||||
|
proven correct without refactor. The proof structure is the same as
|
||||||
|
Path A but uses the Mathlib lemmas to bridge between `String.push`
|
||||||
|
and `List.cons`.
|
||||||
|
|
||||||
|
**Cost**: Add `mathlib` as a dependency to `infoductor` (already
|
||||||
|
present transitively via `Mathlib` import in some Foundation files,
|
||||||
|
but not directly used). Then ~200 lines of proof using Mathlib's
|
||||||
|
String API.
|
||||||
|
|
||||||
|
## Lemma-by-lemma proof plan
|
||||||
|
|
||||||
|
### Lemma 1: `readIdent_app`
|
||||||
|
|
||||||
|
**Statement** (Path A version):
|
||||||
|
```lean
|
||||||
|
theorem readIdent_app : ∀ (chars rest : List Char) (acc : List Char) (fuel : Nat),
|
||||||
|
fuel > chars.length →
|
||||||
|
(∀ c ∈ chars, isIdentRestChar c = true) →
|
||||||
|
(rest = [] ∨ ∃ c rest', rest = c :: rest' ∧ isIdentRestChar c = false) →
|
||||||
|
readIdent fuel (chars ++ rest) acc = (acc ++ chars, rest)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proof**: structural induction on `chars`. Base case (`[]`) splits on
|
||||||
|
the boundary form. Inductive case unfolds `readIdent` once, applies the
|
||||||
|
ident-rest hypothesis, and recurses via IH with `acc ++ [d]`.
|
||||||
|
|
||||||
|
**Difficulty**: Low (Path A) / Medium (Path B).
|
||||||
|
|
||||||
|
### Lemma 2: `readStrLit_app`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem readStrLit_app (s : String) (rest : List Char) (acc : List Char) (fuel : Nat) :
|
||||||
|
fuel > (escapeStrLitBody s).length + 1 →
|
||||||
|
readStrLit fuel ((escapeStrLitBody s).toList ++ '"' :: rest) acc =
|
||||||
|
some (acc ++ s.toList, rest)
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `escapeStrLitBody s` is the inner part of `escapeStrLit` (without
|
||||||
|
the surrounding `"`).
|
||||||
|
|
||||||
|
**Proof**: induction on `s.toList`. Each char either escapes (`"` → `\"`,
|
||||||
|
`\` → `\\`) or passes through. The `readStrLit` decoder reverses the
|
||||||
|
escape exactly.
|
||||||
|
|
||||||
|
**Difficulty**: Medium. The escape/unescape pairing is straightforward
|
||||||
|
but tedious: 3 sub-cases per char (escape `"`, escape `\`, normal).
|
||||||
|
|
||||||
|
### Lemma 3: `tokenize_app_clean`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem tokenize_app_clean (xs ys : List Char) :
|
||||||
|
isCleanlyTerminated xs →
|
||||||
|
tokenize (xs ++ ys) = tokenize xs ++ tokenize ys
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `isCleanlyTerminated xs` says `xs` ends in a single-char-token
|
||||||
|
(paren), whitespace, or string-literal close — *not* in the middle of
|
||||||
|
an ident, num, or open string.
|
||||||
|
|
||||||
|
**Proof**: induction on `xs`. Use the foundation tokenize lemmas
|
||||||
|
(`tokenize_lparen_cons` etc.) for the constructor cases. The
|
||||||
|
"cleanly terminated" predicate ensures no token spans the boundary.
|
||||||
|
|
||||||
|
**Difficulty**: Medium. Defining `isCleanlyTerminated` precisely and
|
||||||
|
proving each clean-prefix case takes care.
|
||||||
|
|
||||||
|
### Lemma 4: `tokenize_render_name`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem tokenize_render_name : ∀ (n : Lean.Name),
|
||||||
|
tokenize (nameToLeanSource n).toList = nameToTokens n
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proof**: induction on `n`.
|
||||||
|
|
||||||
|
- **`.anonymous`**: `nameToLeanSource .anonymous = "Lean.Name.anonymous"`.
|
||||||
|
Tokenize this via `readIdent_app` (Lemma 1) with `rest = []`, gives
|
||||||
|
`[.ident "Lean.Name.anonymous"]`. Equals `nameToTokens .anonymous`.
|
||||||
|
|
||||||
|
- **`.str p s`**: `nameToLeanSource (.str p s) = "(Lean.Name.str " ++ nameToLeanSource p ++ " " ++ escapeStrLit s ++ ")"`.
|
||||||
|
|
||||||
|
Use `String.toList_append` (Mathlib) or the chars-version of the
|
||||||
|
renderer to break `(...)`.toList into:
|
||||||
|
`['('] ++ "Lean.Name.str ".toList ++ (nameToLeanSource p).toList ++ [' '] ++ (escapeStrLit s).toList ++ [')']`
|
||||||
|
|
||||||
|
Then apply `tokenize_app_clean` (Lemma 3) repeatedly:
|
||||||
|
- `'('` → `lparen :: tokenize ...`
|
||||||
|
- `"Lean.Name.str ".toList` → `ident "Lean.Name.str" :: tokenize ...`
|
||||||
|
(via `readIdent_app` with rest starting at `' '`)
|
||||||
|
- `(nameToLeanSource p).toList` → `nameToTokens p ++ tokenize ...`
|
||||||
|
(via IH)
|
||||||
|
- `[' ']` → `tokenize ...` (whitespace skip)
|
||||||
|
- `(escapeStrLit s).toList` → `[strLit s] ++ tokenize ...`
|
||||||
|
(via `readStrLit_app`)
|
||||||
|
- `[')']` → `[rparen]`
|
||||||
|
|
||||||
|
Combine: `[lparen, ident "Lean.Name.str"] ++ nameToTokens p ++ [strLit s, rparen]`
|
||||||
|
= `nameToTokens (.str p s)`. ✓
|
||||||
|
|
||||||
|
- **`.num p k`**: analogous to `.str`, using `numLit` instead of `strLit`.
|
||||||
|
|
||||||
|
**Difficulty**: Medium-High. Each step requires careful invocation of
|
||||||
|
Lemmas 1–3.
|
||||||
|
|
||||||
|
### Lemma 5: `tokenize_render_classifier`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem tokenize_render_classifier : ∀ (φ : MetaClassifier),
|
||||||
|
tokenize (MetaClassifier.toLeanSource φ).toList = MetaClassifier.toTokens φ
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proof**: 9 cases mirroring the 9 classifier arms. The atomic arms
|
||||||
|
(`.always`, `.never`) are direct. The `.atDecl`/`.inFile`/`.underAttribute`/`.dependencyOf`/`.inNamespace` cases use Lemmas 1+3+4. The `.meet`/`.join`
|
||||||
|
cases use Lemmas 1+3+IH.
|
||||||
|
|
||||||
|
**Difficulty**: Medium. Mostly mechanical after Lemmas 1–4.
|
||||||
|
|
||||||
|
### Lemma 6: `tokenize_render_cterm`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem tokenize_render_cterm : ∀ (t : MetaCTerm),
|
||||||
|
tokenize (MetaCTerm.toLeanSource t).toList = MetaCTerm.toTokens t
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proof**: 8 cases. Atomic: `.empty`. Compositional: `.ident` (uses
|
||||||
|
Lemma 4), `.sym` (uses Lemma 2), `.app` (uses Lemma 6 IH), `.lam`/`.plam`
|
||||||
|
(uses Lemma 2 + IH), `.comp`/`.transp` (uses Lemmas 2 + 5 + IH).
|
||||||
|
|
||||||
|
**Difficulty**: Medium-High. Most complex case is `.comp` with 5
|
||||||
|
sub-fields chained.
|
||||||
|
|
||||||
|
### Lemma 7: `tokenize_render_artifact`
|
||||||
|
|
||||||
|
**Statement**:
|
||||||
|
```lean
|
||||||
|
theorem tokenize_render_artifact : ∀ (a : MetaArtifact),
|
||||||
|
a.supported = true →
|
||||||
|
tokenize (MetaArtifact.toLeanSource a).toList = MetaArtifact.toTokens a
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proof**: 5 cases. `.empty`/`.source`: direct. `.cterm`: uses Lemma 6.
|
||||||
|
`.refTo`: uses Lemma 4. `.declAt`: excluded by hypothesis.
|
||||||
|
|
||||||
|
**Difficulty**: Low after Lemmas 4–6.
|
||||||
|
|
||||||
|
### Composing: the universal String-level round-trip
|
||||||
|
|
||||||
|
```lean
|
||||||
|
theorem MetaCTerm.toLeanSource_round_trip : ∀ (t : MetaCTerm),
|
||||||
|
MetaCTerm.fromLeanSource? (MetaCTerm.toLeanSource t) = some t := by
|
||||||
|
intro t
|
||||||
|
unfold MetaCTerm.fromLeanSource? tokenizeStr
|
||||||
|
rw [tokenize_render_cterm t] -- Phase 3: Lemma 6
|
||||||
|
exact cTermFromTokens?_round_trip t -- Phase 2 (already proven)
|
||||||
|
```
|
||||||
|
|
||||||
|
Two-line proof composing Phase 2 with Lemma 6. Same shape for the
|
||||||
|
other three meta-mirror types.
|
||||||
|
|
||||||
|
## Estimated effort
|
||||||
|
|
||||||
|
| Lemma | Path A (refactor) | Path B (Mathlib) |
|
||||||
|
|-------|-------------------|------------------|
|
||||||
|
| 1. `readIdent_app` | 30 lines proof + 20 lines refactor | 50 lines proof |
|
||||||
|
| 2. `readStrLit_app` | 50 lines | 70 lines |
|
||||||
|
| 3. `tokenize_app_clean` | 80 lines (incl. predicate def) | 80 lines |
|
||||||
|
| 4. `tokenize_render_name` | 40 lines | 40 lines |
|
||||||
|
| 5. `tokenize_render_classifier` | 90 lines | 90 lines |
|
||||||
|
| 6. `tokenize_render_cterm` | 120 lines | 120 lines |
|
||||||
|
| 7. `tokenize_render_artifact` | 30 lines | 30 lines |
|
||||||
|
| Composition | 16 lines | 16 lines |
|
||||||
|
| **Total** | **~470 lines** | **~500 lines** |
|
||||||
|
|
||||||
|
Approximately **2–3 focused days** of Lean proof work for either path.
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**Path A (refactor)** is preferred:
|
||||||
|
|
||||||
|
- Stays within Lean core (no new dependencies).
|
||||||
|
- The refactor is small (~20 lines) and independently sound.
|
||||||
|
- Decouples *all* Phase 3 proofs from String internals — no future
|
||||||
|
proof has to wrestle with `String.push`/`String.append` again.
|
||||||
|
- The `List Char` accumulator is arguably more Eulerian: tokens are
|
||||||
|
conceptually sequences of characters, and operating on lists makes
|
||||||
|
the equational reasoning transparent.
|
||||||
|
|
||||||
|
The refactor is itself the small Eulerian move; the rest of the
|
||||||
|
proofs follow naturally.
|
||||||
|
|
||||||
|
## What this unlocks
|
||||||
|
|
||||||
|
With Phase 3 complete, the bridge has:
|
||||||
|
|
||||||
|
```
|
||||||
|
∀ t : MetaCTerm, fromLeanSource? (toLeanSource t) = some t
|
||||||
|
∀ φ : MetaClassifier, fromLeanSource? (toLeanSource φ) = some φ
|
||||||
|
∀ n : Lean.Name, nameFromLeanSource? (nameToLeanSource n) = some n
|
||||||
|
∀ a : MetaArtifact, a.supported → fromLeanSource? (toLeanSource a) = some a
|
||||||
|
```
|
||||||
|
|
||||||
|
The full String-level universal round-trip — provably, by structural
|
||||||
|
induction, with no closed-instance restrictions. The `Edit.apply`
|
||||||
|
output (Lean source files written to disk) is then provably
|
||||||
|
round-trippable: any consumer reading the file back gets the original
|
||||||
|
`MetaCTerm`.
|
||||||
|
|
||||||
|
This is the operational soundness of the entire bridge at the source
|
||||||
|
level — what topolei or any external consumer needs to trust the
|
||||||
|
edit-render-parse loop.
|
||||||
Loading…
Add table
Reference in a new issue