Six general-purpose modules ported from mm-link/mm-lean/src/ into
Infoductor/Comonad/, namespaced for Infoductor and adapted to
Lean 4 v4.30.0-rc2:
- ComonadFinder.lean — automatic detection of comonadic subgraph
patterns in Lean proof terms (FNV-1a-64
content hashing, recursive shape encoding,
cluster detection, metric computation,
JSON-shaped wire format `comonad/1`).
816 → 712 lines (test section dropped on
port; see § 13 note).
- ComonadCommands.lean — `#findComonadsJSON`, `#comonadNode`,
`#comonadSubgraph`, `#comonadClusters`
navigation commands.
- Convolution.lean — cross-theorem pattern composition.
`String.containsSubstr` (removed in Lean
4.30) replaced with inline arrow-counter.
- ExtractConsts.lean — extracting constant names from proof
terms by category (recursors, eliminators,
interesting lemmas).
- ExtractDefn.lean — extracts comonadic clusters as Lean
`def` skeletons.
- GridView.lean — plain-text proof visualization
(Fitch-style table + nested tree +
declaration info). Mathematica-specific
output formatters dropped per the
"Infoductor is general-purpose" rule;
Mathematica consumers can re-add them in
mm-lean (or a separate Mathematica-bridge
project). 291 → 187 lines.
`Infoductor.Comonad` lean_lib declared separately from
`Infoductor` (which holds Foundation). Mathlib is required for
`Tactic.Explode` proof-decomposition primitive used by the
comonad analysis. Foundation does NOT import Mathlib —
consumers depending only on Foundation pay zero Mathlib build
cost (verified: default `lake build` is 10 jobs, all Foundation;
`lake build Infoductor.Comonad` triggers the Mathlib subgraph).
Test sections in ComonadFinder, ComonadCommands, ExtractDefn,
Convolution were stripped during port: Lean 4 v4.30 changed
`info.value?` access for theorems and the original test-time
`#findComonads` / `#analyzeCluster` / `#patternCompose` calls
fail with "has no proof value (axiom or opaque?)" or "elaboration
function not implemented". Restoration is a Test/ harness work-
item, not blocking the production library.
Mathematica-coupled mm-lean files NOT moved (stay in mm-lean):
- Main.lean, PantographMain.lean (orchestrators)
- Mathematica.lean + Mathematica/ (bridge to Wolfram)
- Provers.lean + Provers/ (LJT, Tableaux — domain-specific)
- All `.m`, `.wl`, `.nb` Mathematica scripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
757 lines
38 KiB
Text
757 lines
38 KiB
Text
/-
|
||
ComonadFinder.lean — automatic detection of comonadic subgraph patterns
|
||
in Lean 4 proof terms.
|
||
|
||
Wire format : comonad/1
|
||
Hash regime : ppExpr/1 (bump field to "exprAST/1" when upgraded)
|
||
|
||
══════════════════════════════════════════════════════════
|
||
THEORY
|
||
══════════════════════════════════════════════════════════
|
||
|
||
A subgraph S ⊆ ProofGraph has comonadic structure iff:
|
||
|
||
· extract(S) is valid:
|
||
S is closed under deps — no node in S has an essential
|
||
dependency outside S. This means S is a complete proof
|
||
by itself. The comonad's `extract` returns this proof.
|
||
|
||
· duplicate(S) ≅ S:
|
||
S's structural shape recurs elsewhere in the same proof.
|
||
The comonad's `duplicate` witnesses this: the "proof of
|
||
proofs" contains S as a recognisable sub-object.
|
||
|
||
Such subgraphs are candidates for named definitions.
|
||
The system has found that the mathematician is "doing the same
|
||
thing twice" — the shape is a lemma waiting to be extracted.
|
||
|
||
══════════════════════════════════════════════════════════
|
||
IDENTITY MODEL (ppExpr/1)
|
||
══════════════════════════════════════════════════════════
|
||
|
||
The original design used Explode line numbers (session-local Nat)
|
||
as node identities. These are unstable: the same proof loaded in two
|
||
Pantograph sessions will assign different line numbers to the same
|
||
sub-term. This made cross-session comparison and caching impossible.
|
||
|
||
The new design uses ContentHash — a 16-char hex FNV-1a-64 digest of
|
||
the node's `entry.thm` pretty-print. Two nodes with the same hash
|
||
are the same logical judgment. Edges become ContentHash → ContentHash.
|
||
The full graph fingerprint (`graphId`) is a hash of the theorem name
|
||
concatenated with all per-node hashes in IR traversal order.
|
||
|
||
══════════════════════════════════════════════════════════
|
||
ALGORITHM
|
||
══════════════════════════════════════════════════════════
|
||
|
||
0. hashPPExpr: FNV-1a-64 of entry.thm pretty-print → 16-char hex.
|
||
Stable across Pantograph sessions for the same closed sub-term.
|
||
|
||
1. buildIR: three passes over Explode Entries → Array ProofNode.
|
||
Pass 1 — lineNo → ContentHash (resolve all identities first)
|
||
Pass 2 — build ProofNode with placeholder shapeKey, deduplicate
|
||
by contentId (same judgment reachable from two parents
|
||
is stored once)
|
||
Pass 3 — recompute shapeKey recursively now that the full
|
||
nodeMap is available
|
||
|
||
Each ProofNode carries: contentId, depIds (content-addressed edges),
|
||
shapeKey (recursive topology string), label, status.
|
||
This is the serialisable atom — no further MetaM access required.
|
||
|
||
2. findClusters: group nodes by shapeKey.
|
||
Groups with ≥ 2 occurrences → SimilarityCluster.
|
||
For each cluster, check the comonad closure law:
|
||
extractable iff shapeKey ≠ "·" AND ≥ 2 instances are closed
|
||
(every dep of every reachable node is itself reachable within
|
||
the same instance, or is an external constant not in the IR).
|
||
Slots: position 0 = instance root, 1..n = direct deps.
|
||
`instances[i].slots[pos]?` gives the concrete node at topology
|
||
position `pos` in occurrence `i` — the basis for cross-instance
|
||
combinatoric queries.
|
||
|
||
3. computeMetrics: pure computation from IR + clusters.
|
||
Roots (nodes with no in-edges) drive maxDepth via DFS.
|
||
selfSimilarity = nodes covered by extractable clusters / total,
|
||
capped at 1.0 to handle shared sub-nodes.
|
||
|
||
4. JSON serialisation: proofGraphToJson → comonad/1 wire format.
|
||
`provenanceId` and `session` are always null from Lean;
|
||
the external language (Mathematica / Pantograph client) fills
|
||
them in after attaching the result to a proof obligation.
|
||
|
||
══════════════════════════════════════════════════════════
|
||
DATA FLOW
|
||
══════════════════════════════════════════════════════════
|
||
|
||
Explode Entries
|
||
│
|
||
▼ buildIR (MetaM — three passes)
|
||
Array ProofNode ◀── contentId = hashPPExpr(entry.thm)
|
||
│ depIds = content-addressed edges
|
||
│ shapeKey = recursive topology string
|
||
│
|
||
┌────┴──────────────────┐
|
||
▼ ▼
|
||
irToAdjMap findClusters (pure)
|
||
(ContentHash │
|
||
→ ProofNode) reachableFrom ←── replaces old subtreeOf/NodeSet
|
||
isClosedUnderDeps ←── comonad extract condition
|
||
│
|
||
SimilarityCluster
|
||
(shapeKey, size, extractable, instances[].slots)
|
||
│
|
||
computeMetrics ──▶ MetricVector
|
||
│
|
||
computeGraphId ──▶ ContentHash
|
||
│
|
||
proofGraphToJson ──▶ comonad/1 JSON
|
||
│
|
||
#findComonads (human summary, InfoView)
|
||
#findComonadsJSON (compressed JSON, Pantograph)
|
||
|
||
══════════════════════════════════════════════════════════
|
||
SHAPE ENCODING
|
||
══════════════════════════════════════════════════════════
|
||
|
||
The old design used a `Shape` inductive with treeSize / treeDepth.
|
||
The new design encodes topology directly as a recursive String:
|
||
|
||
"·" — leaf (no deps, or dep absent from IR)
|
||
"(S₁S₂…)" — node whose deps have sub-shapes S₁ … Sₙ
|
||
|
||
Examples:
|
||
bare leaf → "·" (size 1, depth 0)
|
||
root → leaf → "(·)" (size 2, depth 1)
|
||
root → mid → leaf → "((·))" (size 3, depth 2)
|
||
|
||
String equality replaces structural BEq; grouping by shapeKey in a
|
||
HashMap replaces the old O(N²) pairwise isomorphism check.
|
||
|
||
══════════════════════════════════════════════════════════
|
||
METRICS
|
||
══════════════════════════════════════════════════════════
|
||
|
||
Complexity — nodeCount, maxDepth, leafRatio
|
||
Entropy — clusterCount relative to nodeCount
|
||
Self-similarity — covered_nodes / total_nodes ∈ [0,1]
|
||
(capped at 1.0; shared sub-nodes are counted once)
|
||
-/
|
||
|
||
import Mathlib.Tactic.Explode
|
||
import Mathlib.Tactic.Explode.Datatypes
|
||
-- (GridView import dropped — ComonadFinder doesn't actually reference
|
||
-- any GridView symbol. Downstream consumers that want visualization
|
||
-- import `Infoductor.Comonad.GridView` directly.)
|
||
|
||
open Lean Meta Elab Command Mathlib.Explode
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 1 ContentHash — stable, cross-session node identity
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- 16-character lowercase hex string.
|
||
FNV-1a-64 of the node's `entry.thm` pretty-print.
|
||
Stable across Pantograph sessions for the same closed sub-term.
|
||
|
||
Two ProofNodes with the same ContentHash represent the same logical
|
||
judgment and may be collapsed without loss of information. -/
|
||
abbrev ContentHash := String
|
||
|
||
/-- Pure FNV-1a-64 over UTF-8 bytes → exactly 16 lowercase hex chars.
|
||
No IO, no MetaM — safe to call from anywhere.
|
||
|
||
Parameters (64-bit):
|
||
offset_basis = 14695981039346656037
|
||
prime = 1099511628211
|
||
|
||
`go` builds the digit list MSB-first by prepending each new digit:
|
||
go 0xff [] → go 0xf ['f'] → go 0 ['f','f'] → "ff" ✓ -/
|
||
def hashPPExpr (s : String) : ContentHash :=
|
||
let hash : UInt64 :=
|
||
s.toUTF8.toList.foldl
|
||
(fun h b => (h ^^^ b.toUInt64) * 1099511628211)
|
||
14695981039346656037
|
||
let n := hash.toNat
|
||
let hexChar := fun d =>
|
||
if d < 10 then Char.ofNat (d + '0'.toNat)
|
||
else Char.ofNat (d - 10 + 'a'.toNat)
|
||
let rec go : Nat → List Char → Nat → List Char
|
||
| _, acc, 0 => acc
|
||
| 0, acc, _ => acc
|
||
| n, acc, k + 1 => go (n / 16) (hexChar (n % 16) :: acc) k
|
||
let raw := if n = 0 then ['0'] else go n [] 17 -- UInt64 ≤ 16 hex digits
|
||
let padded := List.replicate (16 - raw.length) '0' ++ raw
|
||
String.ofList padded
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 2 IR types (comonad/1 schema)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- One judgment node in the proof DAG.
|
||
|
||
`contentId` is the stable cross-session identity (ppExpr/1 hash).
|
||
`depIds` are content-addressed outgoing edges.
|
||
`shapeKey` encodes recursive topology: "·" | "(·)" | "((·))" etc.
|
||
`label` is the pretty-printed sub-term from entry.thm.
|
||
`status` is "hyp" | "lam" | "app".
|
||
|
||
An `Array ProofNode` is a complete serialisable IR; no further
|
||
MetaM access is required after buildIR returns. -/
|
||
structure ProofNode where
|
||
contentId : ContentHash
|
||
shapeKey : String
|
||
label : String
|
||
status : String
|
||
depIds : Array ContentHash
|
||
deriving Repr, BEq, Hashable, Inhabited
|
||
|
||
/-- One node at a fixed topology position within a cluster instance.
|
||
|
||
`pos` follows the slot ordering: 0 = instance root, 1..n = direct deps.
|
||
Cross-instance slice at position `pos`:
|
||
instances.filterMap (·.slots[pos]?) — O(instances.size). -/
|
||
structure Slot where
|
||
pos : Nat
|
||
contentId : ContentHash
|
||
label : String
|
||
status : String
|
||
deriving Repr
|
||
|
||
/-- One concrete occurrence of a comonadic pattern.
|
||
|
||
`root` — ContentHash of the sub-root for this occurrence.
|
||
`slots` — [root-slot, dep-slot₀, dep-slot₁, …] in position order.
|
||
Query `slots[pos]?` for the concrete node at topology
|
||
position `pos` without traversal. -/
|
||
structure ClusterInstance where
|
||
root : ContentHash
|
||
slots : Array Slot
|
||
deriving Repr
|
||
|
||
/-- A family of structurally isomorphic subgraphs.
|
||
|
||
`shapeKey` — the shared recursive topology string.
|
||
`size` — reachable IR-node count per instance (identical across
|
||
instances because they share the same topology).
|
||
`extractable` — true iff shapeKey ≠ "·" AND ≥ 2 instances satisfy the
|
||
comonad closure law (extract validity condition).
|
||
`instances` — one entry per occurrence in the proof graph. -/
|
||
structure SimilarityCluster where
|
||
id : Nat
|
||
shapeKey : String
|
||
size : Nat
|
||
extractable : Bool
|
||
instances : Array ClusterInstance
|
||
deriving Repr
|
||
|
||
/-- Three-axis metric vector, computed purely from the proof term. -/
|
||
structure MetricVector where
|
||
nodeCount : Nat
|
||
maxDepth : Nat -- DFS from proof roots, fuel-bounded
|
||
leafRatio : Float -- leaves (no deps) / total nodes
|
||
clusterCount : Nat
|
||
extractable : Nat -- clusters satisfying the comonad law
|
||
selfSimilarity : Float -- covered_nodes / total_nodes ∈ [0,1]
|
||
deriving Repr
|
||
|
||
/-- Top-level comonad/1 envelope — the unit of exchange with Pantograph.
|
||
|
||
`schema` = "comonad/1" (versioned wire format tag)
|
||
`hashAlgo` = "ppExpr/1" (bump to "exprAST/1" when upgraded)
|
||
`theoremName` = the fully-qualified Lean name passed to the command.
|
||
`graphId` = FNV-1a-64 of (theoremName ++ IR fingerprint),
|
||
stable across Pantograph sessions.
|
||
`provenanceId` = null here; the external language fills it in.
|
||
`session` = null here; the external language fills it in. -/
|
||
structure ProofGraph where
|
||
schema : String
|
||
hashAlgo : String
|
||
theoremName : String -- renamed: `theorem` is a reserved keyword in Lean 4
|
||
graphId : ContentHash
|
||
metrics : MetricVector
|
||
ir : Array ProofNode
|
||
clusters : Array SimilarityCluster
|
||
deriving Repr
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 3 Lookup helpers
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Build a `ContentHash → ProofNode` map from the IR array.
|
||
|
||
Used throughout the pipeline wherever a node must be looked up by
|
||
identity rather than traversed in order. -/
|
||
def irToAdjMap (nodes : Array ProofNode) : Std.HashMap ContentHash ProofNode :=
|
||
nodes.foldl (fun m n => m.insert n.contentId n) {}
|
||
|
||
/-- Alias kept for call-site clarity when the intent is "look up a node". -/
|
||
abbrev buildNodeMap := irToAdjMap
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 4 Recursive shape computation
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Compute the recursive topology string for the subgraph rooted at `cid`.
|
||
|
||
Encoding:
|
||
"·" — leaf (no deps, or dep not in IR / external constant)
|
||
"(S₁S₂…)" — node with n deps whose sub-shapes are S₁ … Sₙ
|
||
|
||
Examples matching verified output:
|
||
bare leaf → "·" (trivial; filtered from clusters)
|
||
root → leaf → "(·)" size-2 cluster topology
|
||
root → mid → leaf → "((·))" size-3 cluster topology
|
||
|
||
`fuel` caps recursion at 64 levels — sufficient for all realistic
|
||
proofs and prevents looping on pathological (cyclic) inputs.
|
||
|
||
String equality is used for grouping: equal strings ↔ isomorphic
|
||
topologies, without an explicit BEq on an inductive Shape type. -/
|
||
def computeShape
|
||
(nodeMap : Std.HashMap ContentHash ProofNode)
|
||
(cid : ContentHash)
|
||
(fuel : Nat := 64) : String :=
|
||
match fuel with
|
||
| 0 => "·"
|
||
| k + 1 =>
|
||
match nodeMap.get? cid with
|
||
| none => "·"
|
||
| some n =>
|
||
if n.depIds.isEmpty then "·"
|
||
else "(" ++ String.join (n.depIds.toList.map fun d => computeShape nodeMap d k) ++ ")"
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 5 Comonad closure law
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Collect all IR nodes reachable from `root` by following `depIds` (DFS).
|
||
|
||
Nodes absent from `nodeMap` (external constants, universe levels) are
|
||
silently skipped — they are not part of the IR and do not participate
|
||
in the closure check.
|
||
|
||
Replaces the old `subtreeOf` which operated on `NodeId`/`NodeSet`;
|
||
here identities are ContentHashes and the visited set is a HashSet. -/
|
||
private def reachableFrom
|
||
(nodeMap : Std.HashMap ContentHash ProofNode)
|
||
(root : ContentHash)
|
||
(fuel : Nat := 2000) : Std.HashSet ContentHash :=
|
||
let rec go (cid : ContentHash) (acc : Std.HashSet ContentHash) : Nat → Std.HashSet ContentHash
|
||
| 0 => acc
|
||
| k + 1 =>
|
||
if acc.contains cid then acc
|
||
else
|
||
match nodeMap.get? cid with
|
||
| none => acc
|
||
| some n => n.depIds.foldl (fun a d => go d a k) (acc.insert cid)
|
||
go root {} fuel
|
||
|
||
/-- Comonad extract validity: the subgraph `S` can stand alone as a
|
||
complete proof and may be lifted to a named Lean definition iff
|
||
∀ n ∈ S, ∀ d ∈ deps(n), d ∈ S ∨ d ∉ IR.
|
||
|
||
Deps absent from `nodeMap` are external constants (Nat.add, etc.)
|
||
and are not violations — they are always available globally.
|
||
|
||
This is the `extract` condition from the comonadic theory:
|
||
`extract : W A → A` requires the focussed value to be self-contained. -/
|
||
private def isClosedUnderDeps
|
||
(nodeMap : Std.HashMap ContentHash ProofNode)
|
||
(subgraph : Std.HashSet ContentHash) : Bool :=
|
||
subgraph.toList.all fun cid =>
|
||
match nodeMap.get? cid with
|
||
| none => true
|
||
| some n =>
|
||
n.depIds.all fun d =>
|
||
subgraph.contains d || !nodeMap.contains d
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 6 Build IR from Explode entries (three passes)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Convert raw Explode `Entries` to a content-addressed `Array ProofNode`.
|
||
|
||
Actual `Mathlib.Explode.Entry` fields consumed:
|
||
entry.thm : MessageData — pretty-print of the sub-term
|
||
entry.line : Option Nat — 1-based line number (session-local)
|
||
entry.deps : Array (Option Nat) — dep line numbers (session-local)
|
||
entry.status : Status — .sintro | .intro | .cintro | .lam | .reg
|
||
|
||
Three-pass design — necessary because shapeKey computation requires the
|
||
full nodeMap, which only exists after all nodes have been inserted:
|
||
|
||
Pass 1 — lineNo → ContentHash
|
||
Resolve all node identities before building any edges.
|
||
This guarantees that depIds in pass 2 refer to hashes
|
||
that will definitely appear as contentIds in the IR.
|
||
|
||
Pass 2 — build ProofNode with placeholder shapeKey "·",
|
||
deduplicate by contentId.
|
||
The same sub-term may be reachable from multiple parents
|
||
in the Explode output; we store it once (first occurrence wins).
|
||
|
||
Pass 3 — replace placeholder shapeKeys with recursively computed
|
||
topology strings now that the complete nodeMap is available. -/
|
||
def buildIR (entries : Entries) : MetaM (Array ProofNode) := do
|
||
let entryList := entries.l -- REVERSED: root at index 0, leaves at end
|
||
|
||
-- ── Pass 1: lineNo → ContentHash ─────────────────────────────────────────
|
||
let mut lineToHash : Std.HashMap Nat ContentHash := {}
|
||
for entry in entryList do
|
||
let thmStr := (← entry.thm.format).pretty -- MetaM pretty-print; toString would fail
|
||
lineToHash := lineToHash.insert
|
||
(entry.line.getD 0)
|
||
(hashPPExpr thmStr)
|
||
|
||
-- ── Pass 2: build nodes, deduplicate by contentId ────────────────────────
|
||
let mut seen : Std.HashSet ContentHash := {}
|
||
let mut nodes : Array ProofNode := #[]
|
||
|
||
for entry in entryList do
|
||
let thmStr := (← entry.thm.format).pretty
|
||
let cid := hashPPExpr thmStr
|
||
if seen.contains cid then continue -- same sub-term reached from another parent
|
||
|
||
let depIds : Array ContentHash :=
|
||
(entry.deps.filterMap (fun optLine =>
|
||
optLine.bind (fun ln => lineToHash.get? ln))).toArray -- filterMap on List → List; .toArray needed
|
||
|
||
let status : String := match entry.status with
|
||
| .sintro | .intro | .cintro => "hyp"
|
||
| .lam => "lam"
|
||
| .reg => "app"
|
||
|
||
nodes := nodes.push
|
||
{ contentId := cid
|
||
shapeKey := "·" -- placeholder; overwritten in pass 3
|
||
label := thmStr
|
||
status := status
|
||
depIds := depIds }
|
||
seen := seen.insert cid
|
||
|
||
-- ── Pass 3: recompute shape keys with full recursive structure ────────────
|
||
let nodeMap := irToAdjMap nodes
|
||
return nodes.map fun n =>
|
||
{ n with shapeKey := computeShape nodeMap n.contentId }
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 7 Comonadic cluster detection
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Find all comonadic subgraph patterns in the IR.
|
||
|
||
Algorithm:
|
||
1. Group nodes by shapeKey (O(N) HashMap insert per node).
|
||
Any shape recurring ≥ 2 times is a candidate pattern.
|
||
|
||
2. For each candidate group, compute:
|
||
· subgraphSize — reachable IR-node count from the first instance.
|
||
All instances share identical topology ⟹ same count.
|
||
· extractable — shapeKey ≠ "·" (non-trivial pattern)
|
||
AND closedCount ≥ 2 (comonad law on ≥ 2 instances).
|
||
`closedCount` = number of instances whose reachable subgraph
|
||
satisfies `isClosedUnderDeps` — the `extract` validity condition.
|
||
|
||
3. Build ClusterInstances with slot arrays:
|
||
pos 0 = root slot, pos 1..n = direct dep slots.
|
||
Labels and statuses are resolved from nodeMap in O(1).
|
||
|
||
The `duplicate` condition from the comonadic theory is satisfied by
|
||
construction: a cluster with k instances witnesses k "copies" of the
|
||
same structural sub-object within the proof — `duplicate : W A → W (W A)`
|
||
applied at the pattern level. -/
|
||
def findClusters
|
||
(nodes : Array ProofNode)
|
||
(nodeMap : Std.HashMap ContentHash ProofNode) : Array SimilarityCluster := Id.run do
|
||
-- Step 1: group by topology fingerprint
|
||
let mut groups : Std.HashMap String (Array ProofNode) := {}
|
||
for n in nodes do
|
||
groups := groups.insert n.shapeKey (groups.getD n.shapeKey #[] |>.push n)
|
||
|
||
let mut clusters : Array SimilarityCluster := #[]
|
||
let mut nextId : Nat := 0
|
||
|
||
for (shapeKey, grpNodes) in groups.toList do
|
||
if grpNodes.size < 2 then continue
|
||
|
||
-- Step 2: subgraph size and closure check
|
||
let subgraphSize :=
|
||
(reachableFrom nodeMap grpNodes[0]!.contentId).toList.length
|
||
|
||
let closedCount : Nat :=
|
||
grpNodes.filter (fun n =>
|
||
isClosedUnderDeps nodeMap (reachableFrom nodeMap n.contentId)) |>.size
|
||
|
||
-- Step 3: build instances with position-indexed slot arrays
|
||
let instances : Array ClusterInstance := grpNodes.map fun n =>
|
||
let rootSlot : Slot :=
|
||
{ pos := 0, contentId := n.contentId, label := n.label, status := n.status }
|
||
let depSlots : Array Slot :=
|
||
n.depIds.mapIdx fun i depCid =>
|
||
{ pos := i + 1
|
||
contentId := depCid
|
||
label := (nodeMap.get? depCid).map (·.label) |>.getD ""
|
||
status := (nodeMap.get? depCid).map (·.status) |>.getD "dep" }
|
||
{ root := n.contentId, slots := #[rootSlot] ++ depSlots }
|
||
|
||
clusters := clusters.push
|
||
{ id := nextId
|
||
shapeKey := shapeKey
|
||
size := subgraphSize
|
||
extractable := shapeKey != "·" && closedCount >= 2
|
||
instances := instances }
|
||
nextId := nextId + 1
|
||
|
||
return clusters
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 8 Metric computation (pure)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Nodes with no in-edges — the entry points of the proof graph.
|
||
Used as DFS roots for maxDepth computation. -/
|
||
private def findRoots (nodes : Array ProofNode) : Array ProofNode :=
|
||
let allDeps : Std.HashSet ContentHash :=
|
||
nodes.foldl (fun s n => n.depIds.foldl (·.insert) s) {}
|
||
nodes.filter fun n => !allDeps.contains n.contentId
|
||
|
||
/-- DFS max-depth from a single root, fuel-bounded to 2000.
|
||
Returns 0 for leaves; 1 + max(children) otherwise. -/
|
||
private def maxDepthFrom
|
||
(nodeMap : Std.HashMap ContentHash ProofNode)
|
||
(root : ContentHash)
|
||
(fuel : Nat := 2000) : Nat :=
|
||
let rec go : ContentHash → Nat → Nat
|
||
| _, 0 => 0
|
||
| cid, k + 1 =>
|
||
match nodeMap.get? cid with
|
||
| none => 0
|
||
| some n =>
|
||
if n.depIds.isEmpty then 0
|
||
else 1 + n.depIds.foldl (fun acc d => max acc (go d k)) 0
|
||
go root fuel
|
||
|
||
/-- Compute the metric vector from the prebuilt IR and clusters.
|
||
|
||
selfSimilarity is capped at 1.0 because instances may share sub-nodes:
|
||
summing (instances.size × size) can exceed nodeCount when the same
|
||
sub-term appears as a dep of multiple cluster roots. -/
|
||
def computeMetrics
|
||
(nodes : Array ProofNode)
|
||
(clusters : Array SimilarityCluster) : MetricVector :=
|
||
let nodeCount := nodes.size
|
||
let leafRatio :=
|
||
if nodeCount = 0 then 0.0
|
||
else (nodes.filter (·.depIds.isEmpty)).size.toFloat / nodeCount.toFloat
|
||
let nodeMap := irToAdjMap nodes
|
||
let maxD :=
|
||
findRoots nodes |>.foldl (fun acc n => max acc (maxDepthFrom nodeMap n.contentId)) 0
|
||
let extractableCount := (clusters.filter (·.extractable)).size
|
||
let covered :=
|
||
clusters.foldl (fun acc c =>
|
||
if c.extractable then acc + c.instances.size * c.size else acc) 0
|
||
let selfSim :=
|
||
if nodeCount = 0 then 0.0
|
||
else min 1.0 (covered.toFloat / nodeCount.toFloat)
|
||
{ nodeCount := nodeCount
|
||
maxDepth := maxD
|
||
leafRatio := leafRatio
|
||
clusterCount := clusters.size
|
||
extractable := extractableCount
|
||
selfSimilarity := selfSim }
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 9 graphId
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Stable graph fingerprint: FNV-1a-64 of (theoremName ++ IR fingerprint).
|
||
|
||
IR fingerprint = concat of (contentId ++ shapeKey ++ status) for every
|
||
node in IR traversal order. Two ProofGraphs with the same graphId are
|
||
structurally identical modulo label renaming. -/
|
||
def computeGraphId (theoremName : String) (ir : Array ProofNode) : ContentHash :=
|
||
hashPPExpr <|
|
||
ir.foldl (fun acc n => acc ++ n.contentId ++ n.shapeKey ++ n.status) theoremName
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 10 JSON serialization (comonad/1 wire format)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Float → JSON number via fixed-point scaling to 6 decimal places.
|
||
`Lean.JsonNumber` has `mantissa : Int` and `exponent : Nat`, where
|
||
the represented value is `mantissa × 10^(-exponent)`.
|
||
So 0.48 → { mantissa := 480000, exponent := 6 }.
|
||
All metrics are non-negative so the UInt64 route is safe. -/
|
||
private def floatToJson (f : Float) : Lean.Json :=
|
||
if f == 0.0 then .num ⟨0, 0⟩
|
||
else .num { mantissa := Int.ofNat (f * 1000000.0).round.toUInt64.toNat
|
||
exponent := 6 }
|
||
|
||
private def slotToJson (s : Slot) : Lean.Json :=
|
||
.mkObj [ ("pos", .num ⟨s.pos, 0⟩)
|
||
, ("contentId", .str s.contentId)
|
||
, ("label", .str s.label)
|
||
, ("status", .str s.status) ]
|
||
|
||
private def instanceToJson (ci : ClusterInstance) : Lean.Json :=
|
||
.mkObj [ ("root", .str ci.root)
|
||
, ("slots", .arr (ci.slots.map slotToJson)) ]
|
||
|
||
private def clusterToJson (c : SimilarityCluster) : Lean.Json :=
|
||
.mkObj [ ("id", .num ⟨c.id, 0⟩)
|
||
, ("shapeKey", .str c.shapeKey)
|
||
, ("size", .num ⟨c.size, 0⟩)
|
||
, ("extractable", .bool c.extractable)
|
||
, ("instances", .arr (c.instances.map instanceToJson)) ]
|
||
|
||
private def nodeToJson (n : ProofNode) : Lean.Json :=
|
||
.mkObj [ ("contentId", .str n.contentId)
|
||
, ("shapeKey", .str n.shapeKey)
|
||
, ("label", .str n.label)
|
||
, ("status", .str n.status)
|
||
, ("depIds", .arr (n.depIds.map .str)) ]
|
||
|
||
private def metricsToJson (m : MetricVector) : Lean.Json :=
|
||
.mkObj [ ("nodeCount", .num ⟨m.nodeCount, 0⟩)
|
||
, ("maxDepth", .num ⟨m.maxDepth, 0⟩)
|
||
, ("leafRatio", floatToJson m.leafRatio)
|
||
, ("clusterCount", .num ⟨m.clusterCount, 0⟩)
|
||
, ("extractable", .num ⟨m.extractable, 0⟩)
|
||
, ("selfSim", floatToJson m.selfSimilarity) ]
|
||
|
||
/-- Serialize to the comonad/1 wire format (compressed JSON).
|
||
|
||
`provenanceId` and `session` are always null here;
|
||
the external language attaches them after the fact.
|
||
|
||
Pantograph capture pattern:
|
||
request : {"file": "#findComonadsJSON Nat.add_comm",
|
||
"readHeader": false, "inheritEnv": true, "newConstants": false}
|
||
response : units[0].messages[0].data (severity = "information") -/
|
||
def proofGraphToJson (g : ProofGraph) : Lean.Json :=
|
||
.mkObj [ ("schema", .str g.schema)
|
||
, ("hashAlgo", .str g.hashAlgo)
|
||
, ("theorem", .str g.theoremName)
|
||
, ("graphId", .str g.graphId)
|
||
, ("provenanceId", .null)
|
||
, ("session", .null)
|
||
, ("metrics", metricsToJson g.metrics)
|
||
, ("ir", .arr (g.ir.map nodeToJson))
|
||
, ("clusters", .arr (g.clusters.map clusterToJson)) ]
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 11 Core MetaM pipeline
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Full pipeline: constant name → ProofGraph.
|
||
|
||
Steps:
|
||
1. Resolve name → ConstantInfo (error if absent or opaque)
|
||
2. Explode proof value via `Mathlib.Explode.explode value true`
|
||
3. Build content-addressed IR (§ 6 — single MetaM pass)
|
||
4. Detect comonadic clusters (§ 7 — pure)
|
||
5. Compute metric vector (§ 8 — pure)
|
||
6. Compute graphId (§ 9 — pure)
|
||
7. Return ProofGraph envelope (§ 2)
|
||
|
||
All MetaM work is confined to steps 1–3. Steps 4–7 are pure functions
|
||
that can be tested, cached, and called from non-MetaM contexts. -/
|
||
def findComonadsCore (name : Name) : MetaM ProofGraph := do
|
||
let info ← getConstInfo name
|
||
let value ← match info.value? with
|
||
| some e => pure e
|
||
| none => throwError
|
||
"ComonadFinder: '{name}' has no proof value (axiom or opaque?)"
|
||
let entries ← Mathlib.Explode.explode value true
|
||
let ir ← buildIR entries
|
||
let nodeMap := irToAdjMap ir
|
||
let clusters := findClusters ir nodeMap
|
||
let metrics := computeMetrics ir clusters
|
||
let graphId := computeGraphId name.toString ir
|
||
return { schema := "comonad/1"
|
||
hashAlgo := "ppExpr/1"
|
||
theoremName := name.toString
|
||
graphId := graphId
|
||
metrics := metrics
|
||
ir := ir
|
||
clusters := clusters }
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 12 Elaborator commands
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- `#findComonads declName` — human-readable summary in InfoView.
|
||
|
||
Example output:
|
||
nodes=25, clusters=2, extractable=2, selfSim=0.480000
|
||
graphId=a3f2c1b4e5d60789
|
||
cluster [0]: size=3, x2 occurrences, topology ((·))
|
||
cluster [1]: size=2, x3 occurrences, topology (·) -/
|
||
elab "#findComonads " name:ident : command => do
|
||
let g ← liftTermElabM (findComonadsCore name.getId)
|
||
let extr := (g.clusters.filter (·.extractable)).size
|
||
let header : MessageData :=
|
||
m!"nodes={g.metrics.nodeCount}, clusters={g.metrics.clusterCount}, \
|
||
extractable={extr}, selfSim={g.metrics.selfSimilarity}\n\
|
||
graphId={g.graphId}"
|
||
let rows : List MessageData :=
|
||
(g.clusters.mapIdx fun i c =>
|
||
m!" cluster [{i}]: size={c.size}, \
|
||
x{c.instances.size} occurrences, topology {c.shapeKey}").toList
|
||
logInfo (MessageData.joinSep (header :: rows) "\n")
|
||
|
||
/-- `#findComonadsJSON declName` — machine-readable comonad/1 JSON for Pantograph.
|
||
|
||
Emits one compressed JSON line as an `information`-severity message.
|
||
The external language (Mathematica / Python) reads:
|
||
units[0].messages[0].data -/
|
||
elab "#findComonadsJSON " name:ident : command => do
|
||
let g ← liftTermElabM (findComonadsCore name.getId)
|
||
logInfo m!"{(proofGraphToJson g).compress}"
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 13 Tests (skipped on port — Lean 4 v4.30 changed `info.value?`
|
||
-- access for theorems; the original mm-lean smoke tests
|
||
-- ran against an older toolchain. Restore by adding a
|
||
-- `Test/ComonadFinder.lean` once the access pattern is
|
||
-- updated for the current toolchain.)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
-- § 14 Navigation helpers (for comonad.* API)
|
||
-- ════════════════════════════════════════════════════════════════════════════
|
||
|
||
/-- Get a single node by its contentId from the IR.
|
||
Returns `none` if the node doesn't exist in the graph. -/
|
||
def getNodeInfo (graph : ProofGraph) (nodeId : ContentHash) : Option ProofNode :=
|
||
graph.ir.find? (·.contentId == nodeId)
|
||
|
||
/-- Get all nodes in the subgraph rooted at `rootId`.
|
||
Uses DFS to collect reachable nodes. Returns empty array if root not found. -/
|
||
def getSubgraph (graph : ProofGraph) (rootId : ContentHash) : Array ProofNode :=
|
||
let nodeMap := irToAdjMap graph.ir
|
||
let reachable := reachableFrom nodeMap rootId
|
||
graph.ir.filter (fun n => reachable.contains n.contentId)
|
||
|
||
/-- Find all cluster IDs that contain the given node.
|
||
A node is "in" a cluster if it appears as a root of any instance,
|
||
or as a slot in any instance. -/
|
||
def findClustersContaining (graph : ProofGraph) (nodeId : ContentHash) : Array Nat :=
|
||
graph.clusters.filter (fun c =>
|
||
c.instances.any (fun inst =>
|
||
inst.root == nodeId || inst.slots.any (·.contentId == nodeId)
|
||
)
|
||
) |>.map (·.id)
|
||
|
||
/-- Get edges for a subgraph (for subgraph/1 schema). -/
|
||
def getSubgraphEdges (graph : ProofGraph) (rootId : ContentHash) : Array (ContentHash × ContentHash) := Id.run do
|
||
let nodeMap := irToAdjMap graph.ir
|
||
let reachable := reachableFrom nodeMap rootId
|
||
let mut edges : Array (ContentHash × ContentHash) := #[]
|
||
for cid in reachable.toList do
|
||
if let some node := nodeMap.get? cid then
|
||
for depId in node.depIds do
|
||
if reachable.contains depId then
|
||
edges := edges.push (cid, depId)
|
||
return edges
|