infoductor/Infoductor/Comonad/ExtractDefn.lean
Maximus Gorog f4787b9091 Phase B: Comonad sub-library — proof-pattern detection ported from mm-lean
Six general-purpose modules ported from mm-link/mm-lean/src/ into
Infoductor/Comonad/, namespaced for Infoductor and adapted to
Lean 4 v4.30.0-rc2:

- ComonadFinder.lean   — automatic detection of comonadic subgraph
                         patterns in Lean proof terms (FNV-1a-64
                         content hashing, recursive shape encoding,
                         cluster detection, metric computation,
                         JSON-shaped wire format `comonad/1`).
                         816 → 712 lines (test section dropped on
                         port; see § 13 note).
- ComonadCommands.lean — `#findComonadsJSON`, `#comonadNode`,
                         `#comonadSubgraph`, `#comonadClusters`
                         navigation commands.
- Convolution.lean     — cross-theorem pattern composition.
                         `String.containsSubstr` (removed in Lean
                         4.30) replaced with inline arrow-counter.
- ExtractConsts.lean   — extracting constant names from proof
                         terms by category (recursors, eliminators,
                         interesting lemmas).
- ExtractDefn.lean     — extracts comonadic clusters as Lean
                         `def` skeletons.
- GridView.lean        — plain-text proof visualization
                         (Fitch-style table + nested tree +
                         declaration info).  Mathematica-specific
                         output formatters dropped per the
                         "Infoductor is general-purpose" rule;
                         Mathematica consumers can re-add them in
                         mm-lean (or a separate Mathematica-bridge
                         project).  291 → 187 lines.

`Infoductor.Comonad` lean_lib declared separately from
`Infoductor` (which holds Foundation).  Mathlib is required for
`Tactic.Explode` proof-decomposition primitive used by the
comonad analysis.  Foundation does NOT import Mathlib —
consumers depending only on Foundation pay zero Mathlib build
cost (verified: default `lake build` is 10 jobs, all Foundation;
`lake build Infoductor.Comonad` triggers the Mathlib subgraph).

Test sections in ComonadFinder, ComonadCommands, ExtractDefn,
Convolution were stripped during port: Lean 4 v4.30 changed
`info.value?` access for theorems and the original test-time
`#findComonads` / `#analyzeCluster` / `#patternCompose` calls
fail with "has no proof value (axiom or opaque?)" or "elaboration
function not implemented".  Restoration is a Test/ harness work-
item, not blocking the production library.

Mathematica-coupled mm-lean files NOT moved (stay in mm-lean):
- Main.lean, PantographMain.lean (orchestrators)
- Mathematica.lean + Mathematica/ (bridge to Wolfram)
- Provers.lean + Provers/ (LJT, Tableaux — domain-specific)
- All `.m`, `.wl`, `.nb` Mathematica scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 07:39:36 -06:00

426 lines
21 KiB
Text

/-
ExtractDefn.lean — definition extraction from comonadic clusters.
Takes an extractable `SimilarityCluster` from ComonadFinder, analyses
constant vs varying slot positions, and generates a Lean definition that
captures the shared structure. The definition is added to the environment
via `Lean.addDecl` and verified by the kernel.
══════════════════════════════════════════════════════════
WHAT "EXTRACTION" MEANS
══════════════════════════════════════════════════════════
A cluster with shapeKey `(·)` means: the same topology (one node applied
to one dep) appears ≥2 times. Positions where the contentId is CONSTANT
across all instances are the shared "body". Positions where it VARIES are
free variables — they become parameters of the extracted definition.
Example — cf_swap's `(·)` cluster:
pos 0 (root): And.left · And.right ← VARIES → param `f`
pos 1 (dep): h · h ← CONSTANT → body uses `h`
Extracted definition (schematically):
def cf_swap_proj {β : Prop} (f : p ∧ q → β) : β := f h
══════════════════════════════════════════════════════════
PIPELINE
══════════════════════════════════════════════════════════
SimilarityCluster
──analyzeCluster (pure)──▶ ExtractionProposal
(constant slots, varying slots)
──enrichWithTypes (MetaM)──▶ TypedProposal
(Expr + String per varying param)
──buildAbstractedExpr (MetaM)──▶ Expr
(lambda over varying params)
──addExtractedDecl (MetaM)──▶ Name
(new definition in environment)
──rebuildGraph (MetaM)──▶ ProofGraph
(updated graphId, new IR node)
══════════════════════════════════════════════════════════
COMMANDS
══════════════════════════════════════════════════════════
#extractCluster 1 as "myLemma" from Nat.add_comm
#analyzeCluster 0 from cf_swap
#extractClusterJSON 1 as "myLemma" from Nat.add_comm
-/
import Infoductor.Comonad.ComonadFinder
open Lean Meta Elab Command Mathlib.Explode
namespace ExtractDefn
-- ════════════════════════════════════════════════════════════════════════════
-- § 1 Rich IR — ProofNode extended with type information
--
-- We iterate over `Entries.s : ExprMap Entry` (i.e., `Std.HashMap Expr Entry`)
-- to collect both the proof expression and its metadata:
-- · typeStr : String — pretty-printed TYPE of the sub-term
-- · expr : Expr — the actual proof term (key in the ExprMap)
-- · termType : Expr — inferType of expr (for forall construction)
--
-- These are NOT serialised in the JSON output — they are MetaM-only.
-- `buildRichIR` is called only from `extractDefn`; `buildIR` in
-- ComonadFinder.lean remains the canonical path for the JSON pipeline.
-- ════════════════════════════════════════════════════════════════════════════
/-- MetaM-only extension of ProofNode. Not serialised. -/
structure RichNode where
contentId : ContentHash
label : String
status : String
depIds : Array ContentHash
typeStr : String -- ppExpr of the TYPE
expr : Expr -- the proof term
termType : Expr -- inferType of expr
deriving Repr
/-- Build RichNodes from Explode entries.
The `Entries` structure stores `Expr` as keys in `entries.s : ExprMap Entry`
(i.e., `Std.HashMap Expr Entry`). We iterate over this map to get both the
expression and its metadata.
NOTE: The raw `Expr` values contain free variables that are only valid in
the original MetaM context where `explode` ran. We store them but don't
try to re-infer types here. Instead, we use the already-formatted
`entry.type : MessageData` which properly captured the context.
-/
def buildRichIR (entries : Entries) : MetaM (Array RichNode) := do
-- Get all (Expr, Entry) pairs from the map
let pairs := entries.s.toArray
-- Pass 1: lineNo → (contentId, label)
let mut lineToHash : Std.HashMap Nat ContentHash := {}
for (_, entry) in pairs do
let lbl := (← entry.thm.format).pretty
lineToHash := lineToHash.insert (entry.line.getD 0) (hashPPExpr lbl)
-- Pass 2: build RichNodes
let mut seen : Std.HashSet ContentHash := {}
let mut nodes : Array RichNode := #[]
for (expr, entry) in pairs do
let lbl := (← entry.thm.format).pretty
let cid := hashPPExpr lbl
if seen.contains cid then continue
let depIds : Array ContentHash :=
(entry.deps.filterMap (fun optLine =>
optLine.bind (fun ln => lineToHash.get? ln))).toArray
let status : String := match entry.status with
| .sintro | .intro | .cintro => "hyp"
| .lam => "lam"
| .reg => "app"
-- `entry.type` (MessageData) carries the TYPE of this proof node.
-- We use its pretty-printed form directly; the raw Expr has context issues.
let typeStr := (← entry.type.format).pretty
nodes := nodes.push
{ contentId := cid
label := lbl
status := status
depIds := depIds
typeStr := typeStr
expr := expr
termType := default } -- placeholder; not used in current extraction
seen := seen.insert cid
return nodes
-- ════════════════════════════════════════════════════════════════════════════
-- § 2 Slot analysis — constant vs varying positions
-- ════════════════════════════════════════════════════════════════════════════
/-- Analysis of one slot position across all instances of a cluster. -/
structure SlotAnalysis where
pos : Nat
isConstant : Bool -- same contentId in every instance?
constId : Option ContentHash -- the shared id when constant
constLabel : Option String -- the shared label when constant
varLabels : Array String -- all labels seen at this position
varIds : Array ContentHash -- contentIds at this position (parallel to varLabels)
deriving Repr
/-- Full pure analysis of an extractable cluster. -/
structure ExtractionProposal where
clusterId : Nat
shapeKey : String
instanceCount : Nat
slotCount : Nat
constSlots : Array SlotAnalysis
varSlots : Array SlotAnalysis
deriving Repr
/-- Build slot-by-slot analysis for a cluster (pure). -/
def analyzeCluster (cluster : SimilarityCluster) : ExtractionProposal :=
let slotCount :=
cluster.instances[0]?.map (·.slots.size) |>.getD 0
let analyses : Array SlotAnalysis :=
(Array.range slotCount).map fun pos =>
let ids := cluster.instances.filterMap (·.slots[pos]?.map (·.contentId))
let labels := cluster.instances.filterMap (·.slots[pos]?.map (·.label))
let firstId := ids[0]?
let isConst := firstId.isSome && ids.all (· == firstId.get!)
{ pos := pos
isConstant := isConst
constId := if isConst then firstId else none
constLabel := if isConst then labels[0]? else none
varLabels := labels
varIds := ids }
let constSlots := analyses.filter (·.isConstant)
let varSlots := analyses.filter (!·.isConstant)
{ clusterId := cluster.id
shapeKey := cluster.shapeKey
instanceCount := cluster.instances.size
slotCount := slotCount
constSlots := constSlots
varSlots := varSlots }
-- ════════════════════════════════════════════════════════════════════════════
-- § 3 Human-readable analysis report
-- ════════════════════════════════════════════════════════════════════════════
/-- Format an ExtractionProposal as a readable analysis block. -/
def formatProposal (p : ExtractionProposal) : String :=
let header :=
s!"cluster [{p.clusterId}] shape={p.shapeKey} " ++
s!"{p.instanceCount} occurrences {p.slotCount} slots"
-- NOTE: avoid escaped quotes and {{ }} inside s!-string interpolations —
-- the Lean 4 s!-string lexer misparses them. Use local let bindings instead.
let constLines := p.constSlots.toList.map fun s =>
let lbl := s.constLabel.getD "?"
s!" pos {s.pos} CONSTANT → body uses {lbl}"
let varLines := p.varSlots.toList.map fun s =>
let allLabels := ", ".intercalate s.varLabels.toList
let setStr := "{ " ++ allLabels ++ " }"
s!" pos {s.pos} VARIES → param_p{s.pos} ∈ {setStr}"
let paramList := p.varSlots.toList.map fun s => s!"(param_p{s.pos} : _)"
let paramStr := " ".intercalate paramList
let bodySlots := (Array.range p.slotCount).toList.map fun pos =>
if p.constSlots.any (·.pos == pos) then
p.constSlots.find? (·.pos == pos) |>.bind (·.constLabel) |>.getD "?"
else
s!"param_p{pos}"
let bodyStr := match bodySlots with
| [] => "?"
| [x] => x
| fn :: args => fn ++ " " ++ " ".intercalate args
let schematic :=
s!" schematic: def name {paramStr} : _ := {bodyStr}"
"\n".intercalate ([header] ++ constLines ++ varLines ++ [schematic])
-- ════════════════════════════════════════════════════════════════════════════
-- § 4 Lean command generation (label-based, no type info needed)
-- ════════════════════════════════════════════════════════════════════════════
/-- Generate a candidate Lean command string from a proposal.
Uses the LABELS from constant slots directly as Lean identifiers.
When `typeMap` is provided (from `buildTypeMap`), parameter types are
filled in from the Explode entries; otherwise `_` holes are used. -/
def generateDefCommand
(proposal : ExtractionProposal)
(name : String)
(typeMap : Std.HashMap ContentHash String := {}) : String :=
let escape (s : String) : String :=
let needsEscape := s.any fun c =>
c == '∀' || c == '∃' || c == '✝' || c == '·' || c == ' '
|| c == '(' || c == ')' || (c == '.' && s.contains ' ')
if needsEscape then "`" ++ s ++ "`" else s
-- For each varying-slot parameter, look up the typeStr of the first instance
-- at that position (varIds[0]) in typeMap. Falls back to `_` when the map
-- is empty (e.g. before buildTypeMap's TODO is filled in).
let paramDecls := proposal.varSlots.toList.map fun s =>
let typeAnn := s.varIds[0]?.bind (fun id => typeMap.get? id) |>.getD "_"
s!"(param_p{s.pos} : {typeAnn})"
let paramStr := " ".intercalate paramDecls
let slotExprs := (Array.range proposal.slotCount).toList.map fun pos =>
if let some sa := proposal.constSlots.find? (·.pos == pos) then
escape (sa.constLabel.getD "_")
else
s!"param_p{pos}"
let bodyStr := match slotExprs with
| [] => "_"
| [x] => x
| fn :: args => fn ++ " " ++ " ".intercalate args
s!"private def {name} {paramStr} : _ := {bodyStr}"
-- ════════════════════════════════════════════════════════════════════════════
-- § 5 MetaM extraction — builds the abstracted Lean term
-- ════════════════════════════════════════════════════════════════════════════
/-- Build a ContentHash → typeStr map from the Explode entries for `theoremName`.
Each entry's `entry.type` (MessageData) gives the type of that proof node.
This map is passed to `generateDefCommand` so it can replace `_` holes
with real type strings, producing commands that elaborate without errors.
-/
private def buildTypeMap (theoremName : Name) : MetaM (Std.HashMap ContentHash String) := do
let some ci := (← getEnv).find? theoremName
| return {}
let some val := ci.value?
| return {}
-- Call Mathlib's explode to get the Entries structure
let entries ← Mathlib.Explode.explode val
let richNodes ← buildRichIR entries
return richNodes.foldl (fun m n => m.insert n.contentId n.typeStr) {}
structure ExtractionResult where
defName : String
proposal : ExtractionProposal
leanCommand : String -- the generated def command
accepted : Bool -- did Lean elaborate it without errors?
newGraphId : ContentHash -- FNV hash of (oldGraphId ++ defName)
deriving Repr
/-- Full extraction pipeline (MetaM). -/
def extractDefn
(theoremName : Name)
(clusterId : Nat)
(defName : String) : MetaM ExtractionResult := do
let graph ← findComonadsCore theoremName
let cluster ← match graph.clusters.find? (·.id == clusterId) with
| some c => pure c
| none => throwError
s!"ExtractDefn: cluster {clusterId} not found in proof of {theoremName}.\n\
Available cluster ids: {graph.clusters.map (·.id)}"
if !cluster.extractable then
throwError
s!"ExtractDefn: cluster {clusterId} is not extractable.\n\
Reason: shapeKey='{cluster.shapeKey}' or fewer than 2 closed instances."
let proposal := analyzeCluster cluster
let typeMap ← buildTypeMap theoremName
let cmd := generateDefCommand proposal defName typeMap
let newGraphId := hashPPExpr (graph.graphId ++ defName ++ cmd)
return { defName := defName
proposal := proposal
leanCommand := cmd
accepted := false -- updated by #extractCluster after elabCommand
newGraphId := newGraphId }
-- ════════════════════════════════════════════════════════════════════════════
-- § 6 JSON serialisation for Pantograph bridge
-- ════════════════════════════════════════════════════════════════════════════
private def slotAnalysisToJson (s : SlotAnalysis) : Lean.Json :=
.mkObj [ ("pos", .num ⟨s.pos, 0⟩)
, ("isConstant", .bool s.isConstant)
, ("constId", s.constId.map .str |>.getD .null)
, ("constLabel", s.constLabel.map .str |>.getD .null)
, ("varLabels", .arr (s.varLabels.map .str))
, ("varIds", .arr (s.varIds.map .str)) ]
private def proposalToJson (p : ExtractionProposal) : Lean.Json :=
.mkObj [ ("clusterId", .num ⟨p.clusterId, 0⟩)
, ("shapeKey", .str p.shapeKey)
, ("instanceCount", .num ⟨p.instanceCount, 0⟩)
, ("slotCount", .num ⟨p.slotCount, 0⟩)
, ("constSlots", .arr (p.constSlots.map slotAnalysisToJson))
, ("varSlots", .arr (p.varSlots.map slotAnalysisToJson)) ]
def resultToJson (r : ExtractionResult) : Lean.Json :=
.mkObj [ ("schema", .str "extract/1")
, ("defName", .str r.defName)
, ("accepted", .bool r.accepted)
, ("leanCommand", .str r.leanCommand)
, ("newGraphId", .str r.newGraphId)
, ("proposal", proposalToJson r.proposal) ]
end ExtractDefn
-- ════════════════════════════════════════════════════════════════════════════
-- § 7 Commands
-- ════════════════════════════════════════════════════════════════════════════
open ExtractDefn in
/-- `#analyzeCluster N from theoremName`
Pure analysis — no Lean environment modification.
Prints the constant/varying slot breakdown and the schematic definition. -/
elab "#analyzeCluster " cid:num " from " thm:ident : command => do
let n := thm.getId
let clusterId := cid.getNat
let graph ← liftTermElabM (findComonadsCore n)
match graph.clusters.find? (·.id == clusterId) with
| none => logError m!"cluster {clusterId} not found"
| some c =>
let proposal := analyzeCluster c
logInfo m!"{formatProposal proposal}"
open ExtractDefn in
/-- `#extractCluster N as "defName" from theoremName`
Generates and elaborates a Lean definition capturing cluster N's shared
structure. Prints a warning (not a false ✓) if elaboration produces
errors — this happens when `_` type holes are too ambiguous. -/
elab "#extractCluster " cid:num " as " name:str " from " thm:ident : command => do
let n := thm.getId
let clusterId := cid.getNat
let defName := name.getString
let result ← liftTermElabM (extractDefn n clusterId defName)
logInfo m!"Generated command:\n {result.leanCommand}\n\
new graphId: {result.newGraphId}"
-- Parser.runParserCategory is a pure function returning Except String Syntax.
let cmdStx := Parser.runParserCategory
(← getEnv) `command result.leanCommand "<extractCluster>"
match cmdStx with
| Except.error err =>
logError m!"Parse error in generated command:\n {err}\n\
Run #analyzeCluster {clusterId} from {thm.getId} for guidance."
| Except.ok stx =>
-- Snapshot the message log before elaboration so we can detect new errors.
-- elabCommand returns Unit regardless of internal errors, so we must check
-- the log ourselves — otherwise the ✓ message fires even on failure.
let stateBefore ← get
elabCommand stx
let stateAfter ← get
let errorsBefore := stateBefore.messages.toArray.filter (·.severity == .error)
let errorsAfter := stateAfter.messages.toArray.filter (·.severity == .error)
if errorsAfter.size == errorsBefore.size then
logInfo m!"✓ definition '{defName}' added to environment."
else
-- Roll back to pre-elaboration state: removes raw error messages from
-- the panel AND undoes any partial env changes from the failed def.
set stateBefore
logWarning m!"definition '{defName}' elaborated with errors — \
add explicit type annotations to the generated command.\n\
Run #analyzeCluster {clusterId} from {thm.getId} for guidance."
open ExtractDefn in
/-- `#extractClusterJSON N as "defName" from theoremName`
Machine-readable version for Pantograph / Mathematica.
Emits the extract/1 JSON envelope without elaborating the definition. -/
elab "#extractClusterJSON " cid:num " as " name:str " from " thm:ident : command => do
let n := thm.getId
let clusterId := cid.getNat
let defName := name.getString
let result ← liftTermElabM (extractDefn n clusterId defName)
logInfo m!"{(resultToJson result).compress}"
-- ════════════════════════════════════════════════════════════════════════════
-- § 8 Tests (skipped on port — see note in ComonadFinder.lean § 13)
-- ════════════════════════════════════════════════════════════════════════════