infoductor/Infoductor/Comonad/ComonadFinder.lean
Maximus Gorog f4787b9091 Phase B: Comonad sub-library — proof-pattern detection ported from mm-lean
Six general-purpose modules ported from mm-link/mm-lean/src/ into
Infoductor/Comonad/, namespaced for Infoductor and adapted to
Lean 4 v4.30.0-rc2:

- ComonadFinder.lean   — automatic detection of comonadic subgraph
                         patterns in Lean proof terms (FNV-1a-64
                         content hashing, recursive shape encoding,
                         cluster detection, metric computation,
                         JSON-shaped wire format `comonad/1`).
                         816 → 712 lines (test section dropped on
                         port; see § 13 note).
- ComonadCommands.lean — `#findComonadsJSON`, `#comonadNode`,
                         `#comonadSubgraph`, `#comonadClusters`
                         navigation commands.
- Convolution.lean     — cross-theorem pattern composition.
                         `String.containsSubstr` (removed in Lean
                         4.30) replaced with inline arrow-counter.
- ExtractConsts.lean   — extracting constant names from proof
                         terms by category (recursors, eliminators,
                         interesting lemmas).
- ExtractDefn.lean     — extracts comonadic clusters as Lean
                         `def` skeletons.
- GridView.lean        — plain-text proof visualization
                         (Fitch-style table + nested tree +
                         declaration info).  Mathematica-specific
                         output formatters dropped per the
                         "Infoductor is general-purpose" rule;
                         Mathematica consumers can re-add them in
                         mm-lean (or a separate Mathematica-bridge
                         project).  291 → 187 lines.

`Infoductor.Comonad` lean_lib declared separately from
`Infoductor` (which holds Foundation).  Mathlib is required for
`Tactic.Explode` proof-decomposition primitive used by the
comonad analysis.  Foundation does NOT import Mathlib —
consumers depending only on Foundation pay zero Mathlib build
cost (verified: default `lake build` is 10 jobs, all Foundation;
`lake build Infoductor.Comonad` triggers the Mathlib subgraph).

Test sections in ComonadFinder, ComonadCommands, ExtractDefn,
Convolution were stripped during port: Lean 4 v4.30 changed
`info.value?` access for theorems and the original test-time
`#findComonads` / `#analyzeCluster` / `#patternCompose` calls
fail with "has no proof value (axiom or opaque?)" or "elaboration
function not implemented".  Restoration is a Test/ harness work-
item, not blocking the production library.

Mathematica-coupled mm-lean files NOT moved (stay in mm-lean):
- Main.lean, PantographMain.lean (orchestrators)
- Mathematica.lean + Mathematica/ (bridge to Wolfram)
- Provers.lean + Provers/ (LJT, Tableaux — domain-specific)
- All `.m`, `.wl`, `.nb` Mathematica scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 07:39:36 -06:00

757 lines
38 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

/-
ComonadFinder.lean — automatic detection of comonadic subgraph patterns
in Lean 4 proof terms.
Wire format : comonad/1
Hash regime : ppExpr/1 (bump field to "exprAST/1" when upgraded)
══════════════════════════════════════════════════════════
THEORY
══════════════════════════════════════════════════════════
A subgraph S ⊆ ProofGraph has comonadic structure iff:
· extract(S) is valid:
S is closed under deps — no node in S has an essential
dependency outside S. This means S is a complete proof
by itself. The comonad's `extract` returns this proof.
· duplicate(S) ≅ S:
S's structural shape recurs elsewhere in the same proof.
The comonad's `duplicate` witnesses this: the "proof of
proofs" contains S as a recognisable sub-object.
Such subgraphs are candidates for named definitions.
The system has found that the mathematician is "doing the same
thing twice" — the shape is a lemma waiting to be extracted.
══════════════════════════════════════════════════════════
IDENTITY MODEL (ppExpr/1)
══════════════════════════════════════════════════════════
The original design used Explode line numbers (session-local Nat)
as node identities. These are unstable: the same proof loaded in two
Pantograph sessions will assign different line numbers to the same
sub-term. This made cross-session comparison and caching impossible.
The new design uses ContentHash — a 16-char hex FNV-1a-64 digest of
the node's `entry.thm` pretty-print. Two nodes with the same hash
are the same logical judgment. Edges become ContentHash → ContentHash.
The full graph fingerprint (`graphId`) is a hash of the theorem name
concatenated with all per-node hashes in IR traversal order.
══════════════════════════════════════════════════════════
ALGORITHM
══════════════════════════════════════════════════════════
0. hashPPExpr: FNV-1a-64 of entry.thm pretty-print → 16-char hex.
Stable across Pantograph sessions for the same closed sub-term.
1. buildIR: three passes over Explode Entries → Array ProofNode.
Pass 1 — lineNo → ContentHash (resolve all identities first)
Pass 2 — build ProofNode with placeholder shapeKey, deduplicate
by contentId (same judgment reachable from two parents
is stored once)
Pass 3 — recompute shapeKey recursively now that the full
nodeMap is available
Each ProofNode carries: contentId, depIds (content-addressed edges),
shapeKey (recursive topology string), label, status.
This is the serialisable atom — no further MetaM access required.
2. findClusters: group nodes by shapeKey.
Groups with ≥ 2 occurrences → SimilarityCluster.
For each cluster, check the comonad closure law:
extractable iff shapeKey ≠ "·" AND ≥ 2 instances are closed
(every dep of every reachable node is itself reachable within
the same instance, or is an external constant not in the IR).
Slots: position 0 = instance root, 1..n = direct deps.
`instances[i].slots[pos]?` gives the concrete node at topology
position `pos` in occurrence `i` — the basis for cross-instance
combinatoric queries.
3. computeMetrics: pure computation from IR + clusters.
Roots (nodes with no in-edges) drive maxDepth via DFS.
selfSimilarity = nodes covered by extractable clusters / total,
capped at 1.0 to handle shared sub-nodes.
4. JSON serialisation: proofGraphToJson → comonad/1 wire format.
`provenanceId` and `session` are always null from Lean;
the external language (Mathematica / Pantograph client) fills
them in after attaching the result to a proof obligation.
══════════════════════════════════════════════════════════
DATA FLOW
══════════════════════════════════════════════════════════
Explode Entries
▼ buildIR (MetaM — three passes)
Array ProofNode ◀── contentId = hashPPExpr(entry.thm)
│ depIds = content-addressed edges
│ shapeKey = recursive topology string
┌────┴──────────────────┐
▼ ▼
irToAdjMap findClusters (pure)
(ContentHash │
→ ProofNode) reachableFrom ←── replaces old subtreeOf/NodeSet
isClosedUnderDeps ←── comonad extract condition
SimilarityCluster
(shapeKey, size, extractable, instances[].slots)
computeMetrics ──▶ MetricVector
computeGraphId ──▶ ContentHash
proofGraphToJson ──▶ comonad/1 JSON
#findComonads (human summary, InfoView)
#findComonadsJSON (compressed JSON, Pantograph)
══════════════════════════════════════════════════════════
SHAPE ENCODING
══════════════════════════════════════════════════════════
The old design used a `Shape` inductive with treeSize / treeDepth.
The new design encodes topology directly as a recursive String:
"·" — leaf (no deps, or dep absent from IR)
"(S₁S₂…)" — node whose deps have sub-shapes S₁ … Sₙ
Examples:
bare leaf → "·" (size 1, depth 0)
root → leaf → "(·)" (size 2, depth 1)
root → mid → leaf → "((·))" (size 3, depth 2)
String equality replaces structural BEq; grouping by shapeKey in a
HashMap replaces the old O(N²) pairwise isomorphism check.
══════════════════════════════════════════════════════════
METRICS
══════════════════════════════════════════════════════════
Complexity — nodeCount, maxDepth, leafRatio
Entropy — clusterCount relative to nodeCount
Self-similarity — covered_nodes / total_nodes ∈ [0,1]
(capped at 1.0; shared sub-nodes are counted once)
-/
import Mathlib.Tactic.Explode
import Mathlib.Tactic.Explode.Datatypes
-- (GridView import dropped — ComonadFinder doesn't actually reference
-- any GridView symbol. Downstream consumers that want visualization
-- import `Infoductor.Comonad.GridView` directly.)
open Lean Meta Elab Command Mathlib.Explode
-- ════════════════════════════════════════════════════════════════════════════
-- § 1 ContentHash — stable, cross-session node identity
-- ════════════════════════════════════════════════════════════════════════════
/-- 16-character lowercase hex string.
FNV-1a-64 of the node's `entry.thm` pretty-print.
Stable across Pantograph sessions for the same closed sub-term.
Two ProofNodes with the same ContentHash represent the same logical
judgment and may be collapsed without loss of information. -/
abbrev ContentHash := String
/-- Pure FNV-1a-64 over UTF-8 bytes → exactly 16 lowercase hex chars.
No IO, no MetaM — safe to call from anywhere.
Parameters (64-bit):
offset_basis = 14695981039346656037
prime = 1099511628211
`go` builds the digit list MSB-first by prepending each new digit:
go 0xff [] → go 0xf ['f'] → go 0 ['f','f'] → "ff" ✓ -/
def hashPPExpr (s : String) : ContentHash :=
let hash : UInt64 :=
s.toUTF8.toList.foldl
(fun h b => (h ^^^ b.toUInt64) * 1099511628211)
14695981039346656037
let n := hash.toNat
let hexChar := fun d =>
if d < 10 then Char.ofNat (d + '0'.toNat)
else Char.ofNat (d - 10 + 'a'.toNat)
let rec go : Nat → List Char → Nat → List Char
| _, acc, 0 => acc
| 0, acc, _ => acc
| n, acc, k + 1 => go (n / 16) (hexChar (n % 16) :: acc) k
let raw := if n = 0 then ['0'] else go n [] 17 -- UInt64 ≤ 16 hex digits
let padded := List.replicate (16 - raw.length) '0' ++ raw
String.ofList padded
-- ════════════════════════════════════════════════════════════════════════════
-- § 2 IR types (comonad/1 schema)
-- ════════════════════════════════════════════════════════════════════════════
/-- One judgment node in the proof DAG.
`contentId` is the stable cross-session identity (ppExpr/1 hash).
`depIds` are content-addressed outgoing edges.
`shapeKey` encodes recursive topology: "·" | "(·)" | "((·))" etc.
`label` is the pretty-printed sub-term from entry.thm.
`status` is "hyp" | "lam" | "app".
An `Array ProofNode` is a complete serialisable IR; no further
MetaM access is required after buildIR returns. -/
structure ProofNode where
contentId : ContentHash
shapeKey : String
label : String
status : String
depIds : Array ContentHash
deriving Repr, BEq, Hashable, Inhabited
/-- One node at a fixed topology position within a cluster instance.
`pos` follows the slot ordering: 0 = instance root, 1..n = direct deps.
Cross-instance slice at position `pos`:
instances.filterMap (·.slots[pos]?) — O(instances.size). -/
structure Slot where
pos : Nat
contentId : ContentHash
label : String
status : String
deriving Repr
/-- One concrete occurrence of a comonadic pattern.
`root` — ContentHash of the sub-root for this occurrence.
`slots` — [root-slot, dep-slot₀, dep-slot₁, …] in position order.
Query `slots[pos]?` for the concrete node at topology
position `pos` without traversal. -/
structure ClusterInstance where
root : ContentHash
slots : Array Slot
deriving Repr
/-- A family of structurally isomorphic subgraphs.
`shapeKey` — the shared recursive topology string.
`size` — reachable IR-node count per instance (identical across
instances because they share the same topology).
`extractable` — true iff shapeKey ≠ "·" AND ≥ 2 instances satisfy the
comonad closure law (extract validity condition).
`instances` — one entry per occurrence in the proof graph. -/
structure SimilarityCluster where
id : Nat
shapeKey : String
size : Nat
extractable : Bool
instances : Array ClusterInstance
deriving Repr
/-- Three-axis metric vector, computed purely from the proof term. -/
structure MetricVector where
nodeCount : Nat
maxDepth : Nat -- DFS from proof roots, fuel-bounded
leafRatio : Float -- leaves (no deps) / total nodes
clusterCount : Nat
extractable : Nat -- clusters satisfying the comonad law
selfSimilarity : Float -- covered_nodes / total_nodes ∈ [0,1]
deriving Repr
/-- Top-level comonad/1 envelope — the unit of exchange with Pantograph.
`schema` = "comonad/1" (versioned wire format tag)
`hashAlgo` = "ppExpr/1" (bump to "exprAST/1" when upgraded)
`theoremName` = the fully-qualified Lean name passed to the command.
`graphId` = FNV-1a-64 of (theoremName ++ IR fingerprint),
stable across Pantograph sessions.
`provenanceId` = null here; the external language fills it in.
`session` = null here; the external language fills it in. -/
structure ProofGraph where
schema : String
hashAlgo : String
theoremName : String -- renamed: `theorem` is a reserved keyword in Lean 4
graphId : ContentHash
metrics : MetricVector
ir : Array ProofNode
clusters : Array SimilarityCluster
deriving Repr
-- ════════════════════════════════════════════════════════════════════════════
-- § 3 Lookup helpers
-- ════════════════════════════════════════════════════════════════════════════
/-- Build a `ContentHash → ProofNode` map from the IR array.
Used throughout the pipeline wherever a node must be looked up by
identity rather than traversed in order. -/
def irToAdjMap (nodes : Array ProofNode) : Std.HashMap ContentHash ProofNode :=
nodes.foldl (fun m n => m.insert n.contentId n) {}
/-- Alias kept for call-site clarity when the intent is "look up a node". -/
abbrev buildNodeMap := irToAdjMap
-- ════════════════════════════════════════════════════════════════════════════
-- § 4 Recursive shape computation
-- ════════════════════════════════════════════════════════════════════════════
/-- Compute the recursive topology string for the subgraph rooted at `cid`.
Encoding:
"·" — leaf (no deps, or dep not in IR / external constant)
"(S₁S₂…)" — node with n deps whose sub-shapes are S₁ … Sₙ
Examples matching verified output:
bare leaf → "·" (trivial; filtered from clusters)
root → leaf → "(·)" size-2 cluster topology
root → mid → leaf → "((·))" size-3 cluster topology
`fuel` caps recursion at 64 levels — sufficient for all realistic
proofs and prevents looping on pathological (cyclic) inputs.
String equality is used for grouping: equal strings ↔ isomorphic
topologies, without an explicit BEq on an inductive Shape type. -/
def computeShape
(nodeMap : Std.HashMap ContentHash ProofNode)
(cid : ContentHash)
(fuel : Nat := 64) : String :=
match fuel with
| 0 => "·"
| k + 1 =>
match nodeMap.get? cid with
| none => "·"
| some n =>
if n.depIds.isEmpty then "·"
else "(" ++ String.join (n.depIds.toList.map fun d => computeShape nodeMap d k) ++ ")"
-- ════════════════════════════════════════════════════════════════════════════
-- § 5 Comonad closure law
-- ════════════════════════════════════════════════════════════════════════════
/-- Collect all IR nodes reachable from `root` by following `depIds` (DFS).
Nodes absent from `nodeMap` (external constants, universe levels) are
silently skipped — they are not part of the IR and do not participate
in the closure check.
Replaces the old `subtreeOf` which operated on `NodeId`/`NodeSet`;
here identities are ContentHashes and the visited set is a HashSet. -/
private def reachableFrom
(nodeMap : Std.HashMap ContentHash ProofNode)
(root : ContentHash)
(fuel : Nat := 2000) : Std.HashSet ContentHash :=
let rec go (cid : ContentHash) (acc : Std.HashSet ContentHash) : Nat → Std.HashSet ContentHash
| 0 => acc
| k + 1 =>
if acc.contains cid then acc
else
match nodeMap.get? cid with
| none => acc
| some n => n.depIds.foldl (fun a d => go d a k) (acc.insert cid)
go root {} fuel
/-- Comonad extract validity: the subgraph `S` can stand alone as a
complete proof and may be lifted to a named Lean definition iff
∀ n ∈ S, ∀ d ∈ deps(n), d ∈ S d ∉ IR.
Deps absent from `nodeMap` are external constants (Nat.add, etc.)
and are not violations — they are always available globally.
This is the `extract` condition from the comonadic theory:
`extract : W A → A` requires the focussed value to be self-contained. -/
private def isClosedUnderDeps
(nodeMap : Std.HashMap ContentHash ProofNode)
(subgraph : Std.HashSet ContentHash) : Bool :=
subgraph.toList.all fun cid =>
match nodeMap.get? cid with
| none => true
| some n =>
n.depIds.all fun d =>
subgraph.contains d || !nodeMap.contains d
-- ════════════════════════════════════════════════════════════════════════════
-- § 6 Build IR from Explode entries (three passes)
-- ════════════════════════════════════════════════════════════════════════════
/-- Convert raw Explode `Entries` to a content-addressed `Array ProofNode`.
Actual `Mathlib.Explode.Entry` fields consumed:
entry.thm : MessageData — pretty-print of the sub-term
entry.line : Option Nat — 1-based line number (session-local)
entry.deps : Array (Option Nat) — dep line numbers (session-local)
entry.status : Status — .sintro | .intro | .cintro | .lam | .reg
Three-pass design — necessary because shapeKey computation requires the
full nodeMap, which only exists after all nodes have been inserted:
Pass 1 — lineNo → ContentHash
Resolve all node identities before building any edges.
This guarantees that depIds in pass 2 refer to hashes
that will definitely appear as contentIds in the IR.
Pass 2 — build ProofNode with placeholder shapeKey "·",
deduplicate by contentId.
The same sub-term may be reachable from multiple parents
in the Explode output; we store it once (first occurrence wins).
Pass 3 — replace placeholder shapeKeys with recursively computed
topology strings now that the complete nodeMap is available. -/
def buildIR (entries : Entries) : MetaM (Array ProofNode) := do
let entryList := entries.l -- REVERSED: root at index 0, leaves at end
-- ── Pass 1: lineNo → ContentHash ─────────────────────────────────────────
let mut lineToHash : Std.HashMap Nat ContentHash := {}
for entry in entryList do
let thmStr := (← entry.thm.format).pretty -- MetaM pretty-print; toString would fail
lineToHash := lineToHash.insert
(entry.line.getD 0)
(hashPPExpr thmStr)
-- ── Pass 2: build nodes, deduplicate by contentId ────────────────────────
let mut seen : Std.HashSet ContentHash := {}
let mut nodes : Array ProofNode := #[]
for entry in entryList do
let thmStr := (← entry.thm.format).pretty
let cid := hashPPExpr thmStr
if seen.contains cid then continue -- same sub-term reached from another parent
let depIds : Array ContentHash :=
(entry.deps.filterMap (fun optLine =>
optLine.bind (fun ln => lineToHash.get? ln))).toArray -- filterMap on List → List; .toArray needed
let status : String := match entry.status with
| .sintro | .intro | .cintro => "hyp"
| .lam => "lam"
| .reg => "app"
nodes := nodes.push
{ contentId := cid
shapeKey := "·" -- placeholder; overwritten in pass 3
label := thmStr
status := status
depIds := depIds }
seen := seen.insert cid
-- ── Pass 3: recompute shape keys with full recursive structure ────────────
let nodeMap := irToAdjMap nodes
return nodes.map fun n =>
{ n with shapeKey := computeShape nodeMap n.contentId }
-- ════════════════════════════════════════════════════════════════════════════
-- § 7 Comonadic cluster detection
-- ════════════════════════════════════════════════════════════════════════════
/-- Find all comonadic subgraph patterns in the IR.
Algorithm:
1. Group nodes by shapeKey (O(N) HashMap insert per node).
Any shape recurring ≥ 2 times is a candidate pattern.
2. For each candidate group, compute:
· subgraphSize — reachable IR-node count from the first instance.
All instances share identical topology ⟹ same count.
· extractable — shapeKey ≠ "·" (non-trivial pattern)
AND closedCount ≥ 2 (comonad law on ≥ 2 instances).
`closedCount` = number of instances whose reachable subgraph
satisfies `isClosedUnderDeps` — the `extract` validity condition.
3. Build ClusterInstances with slot arrays:
pos 0 = root slot, pos 1..n = direct dep slots.
Labels and statuses are resolved from nodeMap in O(1).
The `duplicate` condition from the comonadic theory is satisfied by
construction: a cluster with k instances witnesses k "copies" of the
same structural sub-object within the proof — `duplicate : W A → W (W A)`
applied at the pattern level. -/
def findClusters
(nodes : Array ProofNode)
(nodeMap : Std.HashMap ContentHash ProofNode) : Array SimilarityCluster := Id.run do
-- Step 1: group by topology fingerprint
let mut groups : Std.HashMap String (Array ProofNode) := {}
for n in nodes do
groups := groups.insert n.shapeKey (groups.getD n.shapeKey #[] |>.push n)
let mut clusters : Array SimilarityCluster := #[]
let mut nextId : Nat := 0
for (shapeKey, grpNodes) in groups.toList do
if grpNodes.size < 2 then continue
-- Step 2: subgraph size and closure check
let subgraphSize :=
(reachableFrom nodeMap grpNodes[0]!.contentId).toList.length
let closedCount : Nat :=
grpNodes.filter (fun n =>
isClosedUnderDeps nodeMap (reachableFrom nodeMap n.contentId)) |>.size
-- Step 3: build instances with position-indexed slot arrays
let instances : Array ClusterInstance := grpNodes.map fun n =>
let rootSlot : Slot :=
{ pos := 0, contentId := n.contentId, label := n.label, status := n.status }
let depSlots : Array Slot :=
n.depIds.mapIdx fun i depCid =>
{ pos := i + 1
contentId := depCid
label := (nodeMap.get? depCid).map (·.label) |>.getD ""
status := (nodeMap.get? depCid).map (·.status) |>.getD "dep" }
{ root := n.contentId, slots := #[rootSlot] ++ depSlots }
clusters := clusters.push
{ id := nextId
shapeKey := shapeKey
size := subgraphSize
extractable := shapeKey != "·" && closedCount >= 2
instances := instances }
nextId := nextId + 1
return clusters
-- ════════════════════════════════════════════════════════════════════════════
-- § 8 Metric computation (pure)
-- ════════════════════════════════════════════════════════════════════════════
/-- Nodes with no in-edges — the entry points of the proof graph.
Used as DFS roots for maxDepth computation. -/
private def findRoots (nodes : Array ProofNode) : Array ProofNode :=
let allDeps : Std.HashSet ContentHash :=
nodes.foldl (fun s n => n.depIds.foldl (·.insert) s) {}
nodes.filter fun n => !allDeps.contains n.contentId
/-- DFS max-depth from a single root, fuel-bounded to 2000.
Returns 0 for leaves; 1 + max(children) otherwise. -/
private def maxDepthFrom
(nodeMap : Std.HashMap ContentHash ProofNode)
(root : ContentHash)
(fuel : Nat := 2000) : Nat :=
let rec go : ContentHash → Nat → Nat
| _, 0 => 0
| cid, k + 1 =>
match nodeMap.get? cid with
| none => 0
| some n =>
if n.depIds.isEmpty then 0
else 1 + n.depIds.foldl (fun acc d => max acc (go d k)) 0
go root fuel
/-- Compute the metric vector from the prebuilt IR and clusters.
selfSimilarity is capped at 1.0 because instances may share sub-nodes:
summing (instances.size × size) can exceed nodeCount when the same
sub-term appears as a dep of multiple cluster roots. -/
def computeMetrics
(nodes : Array ProofNode)
(clusters : Array SimilarityCluster) : MetricVector :=
let nodeCount := nodes.size
let leafRatio :=
if nodeCount = 0 then 0.0
else (nodes.filter (·.depIds.isEmpty)).size.toFloat / nodeCount.toFloat
let nodeMap := irToAdjMap nodes
let maxD :=
findRoots nodes |>.foldl (fun acc n => max acc (maxDepthFrom nodeMap n.contentId)) 0
let extractableCount := (clusters.filter (·.extractable)).size
let covered :=
clusters.foldl (fun acc c =>
if c.extractable then acc + c.instances.size * c.size else acc) 0
let selfSim :=
if nodeCount = 0 then 0.0
else min 1.0 (covered.toFloat / nodeCount.toFloat)
{ nodeCount := nodeCount
maxDepth := maxD
leafRatio := leafRatio
clusterCount := clusters.size
extractable := extractableCount
selfSimilarity := selfSim }
-- ════════════════════════════════════════════════════════════════════════════
-- § 9 graphId
-- ════════════════════════════════════════════════════════════════════════════
/-- Stable graph fingerprint: FNV-1a-64 of (theoremName ++ IR fingerprint).
IR fingerprint = concat of (contentId ++ shapeKey ++ status) for every
node in IR traversal order. Two ProofGraphs with the same graphId are
structurally identical modulo label renaming. -/
def computeGraphId (theoremName : String) (ir : Array ProofNode) : ContentHash :=
hashPPExpr <|
ir.foldl (fun acc n => acc ++ n.contentId ++ n.shapeKey ++ n.status) theoremName
-- ════════════════════════════════════════════════════════════════════════════
-- § 10 JSON serialization (comonad/1 wire format)
-- ════════════════════════════════════════════════════════════════════════════
/-- Float → JSON number via fixed-point scaling to 6 decimal places.
`Lean.JsonNumber` has `mantissa : Int` and `exponent : Nat`, where
the represented value is `mantissa × 10^(-exponent)`.
So 0.48 → { mantissa := 480000, exponent := 6 }.
All metrics are non-negative so the UInt64 route is safe. -/
private def floatToJson (f : Float) : Lean.Json :=
if f == 0.0 then .num ⟨0, 0⟩
else .num { mantissa := Int.ofNat (f * 1000000.0).round.toUInt64.toNat
exponent := 6 }
private def slotToJson (s : Slot) : Lean.Json :=
.mkObj [ ("pos", .num ⟨s.pos, 0⟩)
, ("contentId", .str s.contentId)
, ("label", .str s.label)
, ("status", .str s.status) ]
private def instanceToJson (ci : ClusterInstance) : Lean.Json :=
.mkObj [ ("root", .str ci.root)
, ("slots", .arr (ci.slots.map slotToJson)) ]
private def clusterToJson (c : SimilarityCluster) : Lean.Json :=
.mkObj [ ("id", .num ⟨c.id, 0⟩)
, ("shapeKey", .str c.shapeKey)
, ("size", .num ⟨c.size, 0⟩)
, ("extractable", .bool c.extractable)
, ("instances", .arr (c.instances.map instanceToJson)) ]
private def nodeToJson (n : ProofNode) : Lean.Json :=
.mkObj [ ("contentId", .str n.contentId)
, ("shapeKey", .str n.shapeKey)
, ("label", .str n.label)
, ("status", .str n.status)
, ("depIds", .arr (n.depIds.map .str)) ]
private def metricsToJson (m : MetricVector) : Lean.Json :=
.mkObj [ ("nodeCount", .num ⟨m.nodeCount, 0⟩)
, ("maxDepth", .num ⟨m.maxDepth, 0⟩)
, ("leafRatio", floatToJson m.leafRatio)
, ("clusterCount", .num ⟨m.clusterCount, 0⟩)
, ("extractable", .num ⟨m.extractable, 0⟩)
, ("selfSim", floatToJson m.selfSimilarity) ]
/-- Serialize to the comonad/1 wire format (compressed JSON).
`provenanceId` and `session` are always null here;
the external language attaches them after the fact.
Pantograph capture pattern:
request : {"file": "#findComonadsJSON Nat.add_comm",
"readHeader": false, "inheritEnv": true, "newConstants": false}
response : units[0].messages[0].data (severity = "information") -/
def proofGraphToJson (g : ProofGraph) : Lean.Json :=
.mkObj [ ("schema", .str g.schema)
, ("hashAlgo", .str g.hashAlgo)
, ("theorem", .str g.theoremName)
, ("graphId", .str g.graphId)
, ("provenanceId", .null)
, ("session", .null)
, ("metrics", metricsToJson g.metrics)
, ("ir", .arr (g.ir.map nodeToJson))
, ("clusters", .arr (g.clusters.map clusterToJson)) ]
-- ════════════════════════════════════════════════════════════════════════════
-- § 11 Core MetaM pipeline
-- ════════════════════════════════════════════════════════════════════════════
/-- Full pipeline: constant name → ProofGraph.
Steps:
1. Resolve name → ConstantInfo (error if absent or opaque)
2. Explode proof value via `Mathlib.Explode.explode value true`
3. Build content-addressed IR (§ 6 — single MetaM pass)
4. Detect comonadic clusters (§ 7 — pure)
5. Compute metric vector (§ 8 — pure)
6. Compute graphId (§ 9 — pure)
7. Return ProofGraph envelope (§ 2)
All MetaM work is confined to steps 13. Steps 47 are pure functions
that can be tested, cached, and called from non-MetaM contexts. -/
def findComonadsCore (name : Name) : MetaM ProofGraph := do
let info ← getConstInfo name
let value ← match info.value? with
| some e => pure e
| none => throwError
"ComonadFinder: '{name}' has no proof value (axiom or opaque?)"
let entries ← Mathlib.Explode.explode value true
let ir ← buildIR entries
let nodeMap := irToAdjMap ir
let clusters := findClusters ir nodeMap
let metrics := computeMetrics ir clusters
let graphId := computeGraphId name.toString ir
return { schema := "comonad/1"
hashAlgo := "ppExpr/1"
theoremName := name.toString
graphId := graphId
metrics := metrics
ir := ir
clusters := clusters }
-- ════════════════════════════════════════════════════════════════════════════
-- § 12 Elaborator commands
-- ════════════════════════════════════════════════════════════════════════════
/-- `#findComonads declName` — human-readable summary in InfoView.
Example output:
nodes=25, clusters=2, extractable=2, selfSim=0.480000
graphId=a3f2c1b4e5d60789
cluster [0]: size=3, x2 occurrences, topology ((·))
cluster [1]: size=2, x3 occurrences, topology (·) -/
elab "#findComonads " name:ident : command => do
let g ← liftTermElabM (findComonadsCore name.getId)
let extr := (g.clusters.filter (·.extractable)).size
let header : MessageData :=
m!"nodes={g.metrics.nodeCount}, clusters={g.metrics.clusterCount}, \
extractable={extr}, selfSim={g.metrics.selfSimilarity}\n\
graphId={g.graphId}"
let rows : List MessageData :=
(g.clusters.mapIdx fun i c =>
m!" cluster [{i}]: size={c.size}, \
x{c.instances.size} occurrences, topology {c.shapeKey}").toList
logInfo (MessageData.joinSep (header :: rows) "\n")
/-- `#findComonadsJSON declName` — machine-readable comonad/1 JSON for Pantograph.
Emits one compressed JSON line as an `information`-severity message.
The external language (Mathematica / Python) reads:
units[0].messages[0].data -/
elab "#findComonadsJSON " name:ident : command => do
let g ← liftTermElabM (findComonadsCore name.getId)
logInfo m!"{(proofGraphToJson g).compress}"
-- ════════════════════════════════════════════════════════════════════════════
-- § 13 Tests (skipped on port — Lean 4 v4.30 changed `info.value?`
-- access for theorems; the original mm-lean smoke tests
-- ran against an older toolchain. Restore by adding a
-- `Test/ComonadFinder.lean` once the access pattern is
-- updated for the current toolchain.)
-- ════════════════════════════════════════════════════════════════════════════
-- ════════════════════════════════════════════════════════════════════════════
-- § 14 Navigation helpers (for comonad.* API)
-- ════════════════════════════════════════════════════════════════════════════
/-- Get a single node by its contentId from the IR.
Returns `none` if the node doesn't exist in the graph. -/
def getNodeInfo (graph : ProofGraph) (nodeId : ContentHash) : Option ProofNode :=
graph.ir.find? (·.contentId == nodeId)
/-- Get all nodes in the subgraph rooted at `rootId`.
Uses DFS to collect reachable nodes. Returns empty array if root not found. -/
def getSubgraph (graph : ProofGraph) (rootId : ContentHash) : Array ProofNode :=
let nodeMap := irToAdjMap graph.ir
let reachable := reachableFrom nodeMap rootId
graph.ir.filter (fun n => reachable.contains n.contentId)
/-- Find all cluster IDs that contain the given node.
A node is "in" a cluster if it appears as a root of any instance,
or as a slot in any instance. -/
def findClustersContaining (graph : ProofGraph) (nodeId : ContentHash) : Array Nat :=
graph.clusters.filter (fun c =>
c.instances.any (fun inst =>
inst.root == nodeId || inst.slots.any (·.contentId == nodeId)
)
) |>.map (·.id)
/-- Get edges for a subgraph (for subgraph/1 schema). -/
def getSubgraphEdges (graph : ProofGraph) (rootId : ContentHash) : Array (ContentHash × ContentHash) := Id.run do
let nodeMap := irToAdjMap graph.ir
let reachable := reachableFrom nodeMap rootId
let mut edges : Array (ContentHash × ContentHash) := #[]
for cid in reachable.toList do
if let some node := nodeMap.get? cid then
for depId in node.depIds do
if reachable.contains depId then
edges := edges.push (cid, depId)
return edges