infoductor/Infoductor/Comonad/ComonadFinder.lean

/-
  ComonadFinder.lean — automatic detection of comonadic subgraph patterns
  in Lean 4 proof terms.

  Wire format  : comonad/1
  Hash regime  : ppExpr/1   (bump field to "exprAST/1" when upgraded)

  ══════════════════════════════════════════════════════════
  THEORY
  ══════════════════════════════════════════════════════════

  A subgraph S ⊆ ProofGraph has comonadic structure iff:

    · extract(S) is valid:
        S is closed under deps — no node in S has an essential
        dependency outside S. This means S is a complete proof
        by itself. The comonad's `extract` returns this proof.

    · duplicate(S) ≅ S:
        S's structural shape recurs elsewhere in the same proof.
        The comonad's `duplicate` witnesses this: the "proof of
        proofs" contains S as a recognisable sub-object.

  Such subgraphs are candidates for named definitions.
  The system has found that the mathematician is "doing the same
  thing twice" — the shape is a lemma waiting to be extracted.

  ══════════════════════════════════════════════════════════
  IDENTITY MODEL  (ppExpr/1)
  ══════════════════════════════════════════════════════════

  The original design used Explode line numbers (session-local Nat)
  as node identities. These are unstable: the same proof loaded in two
  Pantograph sessions will assign different line numbers to the same
  sub-term. This made cross-session comparison and caching impossible.

  The new design uses ContentHash — a 16-char hex FNV-1a-64 digest of
  the node's `entry.thm` pretty-print. Two nodes with the same hash
  are the same logical judgment. Edges become ContentHash → ContentHash.
  The full graph fingerprint (`graphId`) is a hash of the theorem name
  concatenated with all per-node hashes in IR traversal order.

  ══════════════════════════════════════════════════════════
  ALGORITHM
  ══════════════════════════════════════════════════════════

  0. hashPPExpr: FNV-1a-64 of entry.thm pretty-print → 16-char hex.
       Stable across Pantograph sessions for the same closed sub-term.

  1. buildIR: three passes over Explode Entries → Array ProofNode.
       Pass 1 — lineNo → ContentHash  (resolve all identities first)
       Pass 2 — build ProofNode with placeholder shapeKey, deduplicate
                by contentId (same judgment reachable from two parents
                is stored once)
       Pass 3 — recompute shapeKey recursively now that the full
                nodeMap is available

     Each ProofNode carries: contentId, depIds (content-addressed edges),
     shapeKey (recursive topology string), label, status.
     This is the serialisable atom — no further MetaM access required.

  2. findClusters: group nodes by shapeKey.
       Groups with ≥ 2 occurrences → SimilarityCluster.
       For each cluster, check the comonad closure law:
         extractable iff shapeKey ≠ "·" AND ≥ 2 instances are closed
         (every dep of every reachable node is itself reachable within
          the same instance, or is an external constant not in the IR).
       Slots: position 0 = instance root, 1..n = direct deps.
       `instances[i].slots[pos]?` gives the concrete node at topology
       position `pos` in occurrence `i` — the basis for cross-instance
       combinatoric queries.

  3. computeMetrics: pure computation from IR + clusters.
       Roots (nodes with no in-edges) drive maxDepth via DFS.
       selfSimilarity = nodes covered by extractable clusters / total,
       capped at 1.0 to handle shared sub-nodes.

  4. JSON serialisation: proofGraphToJson → comonad/1 wire format.
       `provenanceId` and `session` are always null from Lean;
       the external language (Mathematica / Pantograph client) fills
       them in after attaching the result to a proof obligation.

  ══════════════════════════════════════════════════════════
  DATA FLOW
  ══════════════════════════════════════════════════════════

    Explode Entries
          │
          ▼  buildIR (MetaM — three passes)
    Array ProofNode   ◀── contentId = hashPPExpr(entry.thm)
          │                depIds   = content-addressed edges
          │                shapeKey = recursive topology string
          │
     ┌────┴──────────────────┐
     ▼                       ▼
  irToAdjMap           findClusters (pure)
  (ContentHash          │
   → ProofNode)    reachableFrom       ←── replaces old subtreeOf/NodeSet
                   isClosedUnderDeps   ←── comonad extract condition
                        │
                   SimilarityCluster
                   (shapeKey, size, extractable, instances[].slots)
                        │
                   computeMetrics  ──▶  MetricVector
                        │
                   computeGraphId  ──▶  ContentHash
                        │
                   proofGraphToJson ──▶  comonad/1 JSON
                        │
               #findComonads       (human summary, InfoView)
               #findComonadsJSON   (compressed JSON, Pantograph)

  ══════════════════════════════════════════════════════════
  SHAPE ENCODING
  ══════════════════════════════════════════════════════════

  The old design used a `Shape` inductive with treeSize / treeDepth.
  The new design encodes topology directly as a recursive String:

    "·"          — leaf (no deps, or dep absent from IR)
    "(S₁S₂…)"   — node whose deps have sub-shapes S₁ … Sₙ

  Examples:
    bare leaf              → "·"           (size  1, depth 0)
    root → leaf            → "(·)"         (size  2, depth 1)
    root → mid → leaf      → "((·))"       (size  3, depth 2)

  String equality replaces structural BEq; grouping by shapeKey in a
  HashMap replaces the old O(N²) pairwise isomorphism check.

  ══════════════════════════════════════════════════════════
  METRICS
  ══════════════════════════════════════════════════════════

    Complexity     — nodeCount, maxDepth, leafRatio
    Entropy        — clusterCount relative to nodeCount
    Self-similarity — covered_nodes / total_nodes ∈ [0,1]
                      (capped at 1.0; shared sub-nodes are counted once)
-/

import Mathlib.Tactic.Explode
import Mathlib.Tactic.Explode.Datatypes
-- (GridView import dropped — ComonadFinder doesn't actually reference
-- any GridView symbol.  Downstream consumers that want visualization
-- import `Infoductor.Comonad.GridView` directly.)

open Lean Meta Elab Command Mathlib.Explode

-- ════════════════════════════════════════════════════════════════════════════
-- § 1  ContentHash — stable, cross-session node identity
-- ════════════════════════════════════════════════════════════════════════════

/-- 16-character lowercase hex string.
    FNV-1a-64 of the node's `entry.thm` pretty-print.
    Stable across Pantograph sessions for the same closed sub-term.

    Two ProofNodes with the same ContentHash represent the same logical
    judgment and may be collapsed without loss of information. -/
abbrev ContentHash := String

/-- Pure FNV-1a-64 over UTF-8 bytes → exactly 16 lowercase hex chars.
    No IO, no MetaM — safe to call from anywhere.

    Parameters (64-bit):
      offset_basis = 14695981039346656037
      prime        = 1099511628211

    `go` builds the digit list MSB-first by prepending each new digit:
      go 0xff [] → go 0xf ['f'] → go 0 ['f','f'] → "ff"  ✓  -/
def hashPPExpr (s : String) : ContentHash :=
  let hash : UInt64 :=
    s.toUTF8.toList.foldl
      (fun h b => (h ^^^ b.toUInt64) * 1099511628211)
      14695981039346656037
  let n       := hash.toNat
  let hexChar := fun d =>
    if d < 10 then Char.ofNat (d + '0'.toNat)
    else        Char.ofNat (d - 10 + 'a'.toNat)
  let rec go : Nat → List Char → Nat → List Char
    | _,  acc, 0     => acc
    | 0,  acc, _     => acc
    | n,  acc, k + 1 => go (n / 16) (hexChar (n % 16) :: acc) k
  let raw    := if n = 0 then ['0'] else go n [] 17  -- UInt64 ≤ 16 hex digits
  let padded := List.replicate (16 - raw.length) '0' ++ raw
  String.ofList padded

-- ════════════════════════════════════════════════════════════════════════════
-- § 2  IR types  (comonad/1 schema)
-- ════════════════════════════════════════════════════════════════════════════

/-- One judgment node in the proof DAG.

    `contentId` is the stable cross-session identity (ppExpr/1 hash).
    `depIds`    are content-addressed outgoing edges.
    `shapeKey`  encodes recursive topology: "·" | "(·)" | "((·))" etc.
    `label`     is the pretty-printed sub-term from entry.thm.
    `status`    is "hyp" | "lam" | "app".

    An `Array ProofNode` is a complete serialisable IR; no further
    MetaM access is required after buildIR returns. -/
structure ProofNode where
  contentId : ContentHash
  shapeKey  : String
  label     : String
  status    : String
  depIds    : Array ContentHash
  deriving Repr, BEq, Hashable, Inhabited

/-- One node at a fixed topology position within a cluster instance.

    `pos` follows the slot ordering: 0 = instance root, 1..n = direct deps.
    Cross-instance slice at position `pos`:
      instances.filterMap (·.slots[pos]?)  — O(instances.size). -/
structure Slot where
  pos       : Nat
  contentId : ContentHash
  label     : String
  status    : String
  deriving Repr

/-- One concrete occurrence of a comonadic pattern.

    `root`  — ContentHash of the sub-root for this occurrence.
    `slots` — [root-slot, dep-slot₀, dep-slot₁, …] in position order.
              Query `slots[pos]?` for the concrete node at topology
              position `pos` without traversal. -/
structure ClusterInstance where
  root  : ContentHash
  slots : Array Slot
  deriving Repr

/-- A family of structurally isomorphic subgraphs.

    `shapeKey`    — the shared recursive topology string.
    `size`        — reachable IR-node count per instance (identical across
                    instances because they share the same topology).
    `extractable` — true iff shapeKey ≠ "·" AND ≥ 2 instances satisfy the
                    comonad closure law (extract validity condition).
    `instances`   — one entry per occurrence in the proof graph. -/
structure SimilarityCluster where
  id          : Nat
  shapeKey    : String
  size        : Nat
  extractable : Bool
  instances   : Array ClusterInstance
  deriving Repr

/-- Three-axis metric vector, computed purely from the proof term. -/
structure MetricVector where
  nodeCount      : Nat
  maxDepth       : Nat    -- DFS from proof roots, fuel-bounded
  leafRatio      : Float  -- leaves (no deps) / total nodes
  clusterCount   : Nat
  extractable    : Nat    -- clusters satisfying the comonad law
  selfSimilarity : Float  -- covered_nodes / total_nodes ∈ [0,1]
  deriving Repr

/-- Top-level comonad/1 envelope — the unit of exchange with Pantograph.

    `schema`       = "comonad/1"  (versioned wire format tag)
    `hashAlgo`     = "ppExpr/1"   (bump to "exprAST/1" when upgraded)
    `theoremName`  = the fully-qualified Lean name passed to the command.
    `graphId`      = FNV-1a-64 of (theoremName ++ IR fingerprint),
                     stable across Pantograph sessions.
    `provenanceId` = null here; the external language fills it in.
    `session`      = null here; the external language fills it in. -/
structure ProofGraph where
  schema       : String
  hashAlgo     : String
  theoremName  : String   -- renamed: `theorem` is a reserved keyword in Lean 4
  graphId      : ContentHash
  metrics      : MetricVector
  ir           : Array ProofNode
  clusters     : Array SimilarityCluster
  deriving Repr

-- ════════════════════════════════════════════════════════════════════════════
-- § 3  Lookup helpers
-- ════════════════════════════════════════════════════════════════════════════

/-- Build a `ContentHash → ProofNode` map from the IR array.

    Used throughout the pipeline wherever a node must be looked up by
    identity rather than traversed in order. -/
def irToAdjMap (nodes : Array ProofNode) : Std.HashMap ContentHash ProofNode :=
  nodes.foldl (fun m n => m.insert n.contentId n) {}

/-- Alias kept for call-site clarity when the intent is "look up a node". -/
abbrev buildNodeMap := irToAdjMap

-- ════════════════════════════════════════════════════════════════════════════
-- § 4  Recursive shape computation
-- ════════════════════════════════════════════════════════════════════════════

/-- Compute the recursive topology string for the subgraph rooted at `cid`.

    Encoding:
      "·"        — leaf (no deps, or dep not in IR / external constant)
      "(S₁S₂…)" — node with n deps whose sub-shapes are S₁ … Sₙ

    Examples matching verified output:
      bare leaf              → "·"       (trivial; filtered from clusters)
      root → leaf            → "(·)"     size-2 cluster topology
      root → mid → leaf      → "((·))"   size-3 cluster topology

    `fuel` caps recursion at 64 levels — sufficient for all realistic
    proofs and prevents looping on pathological (cyclic) inputs.

    String equality is used for grouping: equal strings ↔ isomorphic
    topologies, without an explicit BEq on an inductive Shape type. -/
def computeShape
    (nodeMap : Std.HashMap ContentHash ProofNode)
    (cid     : ContentHash)
    (fuel    : Nat := 64) : String :=
  match fuel with
  | 0     => "·"
  | k + 1 =>
    match nodeMap.get? cid with
    | none   => "·"
    | some n =>
      if n.depIds.isEmpty then "·"
      else "(" ++ String.join (n.depIds.toList.map fun d => computeShape nodeMap d k) ++ ")"

-- ════════════════════════════════════════════════════════════════════════════
-- § 5  Comonad closure law
-- ════════════════════════════════════════════════════════════════════════════

/-- Collect all IR nodes reachable from `root` by following `depIds` (DFS).

    Nodes absent from `nodeMap` (external constants, universe levels) are
    silently skipped — they are not part of the IR and do not participate
    in the closure check.

    Replaces the old `subtreeOf` which operated on `NodeId`/`NodeSet`;
    here identities are ContentHashes and the visited set is a HashSet. -/
private def reachableFrom
    (nodeMap : Std.HashMap ContentHash ProofNode)
    (root    : ContentHash)
    (fuel    : Nat := 2000) : Std.HashSet ContentHash :=
  let rec go (cid : ContentHash) (acc : Std.HashSet ContentHash) : Nat → Std.HashSet ContentHash
    | 0     => acc
    | k + 1 =>
      if acc.contains cid then acc
      else
        match nodeMap.get? cid with
        | none   => acc
        | some n => n.depIds.foldl (fun a d => go d a k) (acc.insert cid)
  go root {} fuel

/-- Comonad extract validity: the subgraph `S` can stand alone as a
    complete proof and may be lifted to a named Lean definition iff
    ∀ n ∈ S, ∀ d ∈ deps(n), d ∈ S ∨ d ∉ IR.

    Deps absent from `nodeMap` are external constants (Nat.add, etc.)
    and are not violations — they are always available globally.

    This is the `extract` condition from the comonadic theory:
    `extract : W A → A` requires the focussed value to be self-contained. -/
private def isClosedUnderDeps
    (nodeMap  : Std.HashMap ContentHash ProofNode)
    (subgraph : Std.HashSet ContentHash) : Bool :=
  subgraph.toList.all fun cid =>
    match nodeMap.get? cid with
    | none   => true
    | some n =>
      n.depIds.all fun d =>
        subgraph.contains d || !nodeMap.contains d

-- ════════════════════════════════════════════════════════════════════════════
-- § 6  Build IR from Explode entries  (three passes)
-- ════════════════════════════════════════════════════════════════════════════

/-- Convert raw Explode `Entries` to a content-addressed `Array ProofNode`.

    Actual `Mathlib.Explode.Entry` fields consumed:
      entry.thm    : MessageData          — pretty-print of the sub-term
      entry.line   : Option Nat           — 1-based line number (session-local)
      entry.deps   : Array (Option Nat)   — dep line numbers (session-local)
      entry.status : Status               — .sintro | .intro | .cintro | .lam | .reg

    Three-pass design — necessary because shapeKey computation requires the
    full nodeMap, which only exists after all nodes have been inserted:

      Pass 1 — lineNo → ContentHash
               Resolve all node identities before building any edges.
               This guarantees that depIds in pass 2 refer to hashes
               that will definitely appear as contentIds in the IR.

      Pass 2 — build ProofNode with placeholder shapeKey "·",
               deduplicate by contentId.
               The same sub-term may be reachable from multiple parents
               in the Explode output; we store it once (first occurrence wins).

      Pass 3 — replace placeholder shapeKeys with recursively computed
               topology strings now that the complete nodeMap is available. -/
def buildIR (entries : Entries) : MetaM (Array ProofNode) := do
  let entryList := entries.l   -- REVERSED: root at index 0, leaves at end

  -- ── Pass 1: lineNo → ContentHash ─────────────────────────────────────────
  let mut lineToHash : Std.HashMap Nat ContentHash := {}
  for entry in entryList do
    let thmStr := (← entry.thm.format).pretty   -- MetaM pretty-print; toString would fail
    lineToHash := lineToHash.insert
      (entry.line.getD 0)
      (hashPPExpr thmStr)

  -- ── Pass 2: build nodes, deduplicate by contentId ────────────────────────
  let mut seen  : Std.HashSet ContentHash := {}
  let mut nodes : Array ProofNode         := #[]

  for entry in entryList do
    let thmStr := (← entry.thm.format).pretty
    let cid := hashPPExpr thmStr
    if seen.contains cid then continue    -- same sub-term reached from another parent

    let depIds : Array ContentHash :=
      (entry.deps.filterMap (fun optLine =>
        optLine.bind (fun ln => lineToHash.get? ln))).toArray   -- filterMap on List → List; .toArray needed

    let status : String := match entry.status with
      | .sintro | .intro | .cintro => "hyp"
      | .lam                       => "lam"
      | .reg                       => "app"

    nodes := nodes.push
      { contentId := cid
        shapeKey  := "·"              -- placeholder; overwritten in pass 3
        label     := thmStr
        status    := status
        depIds    := depIds }
    seen := seen.insert cid

  -- ── Pass 3: recompute shape keys with full recursive structure ────────────
  let nodeMap := irToAdjMap nodes
  return nodes.map fun n =>
    { n with shapeKey := computeShape nodeMap n.contentId }

-- ════════════════════════════════════════════════════════════════════════════
-- § 7  Comonadic cluster detection
-- ════════════════════════════════════════════════════════════════════════════

/-- Find all comonadic subgraph patterns in the IR.

    Algorithm:
      1. Group nodes by shapeKey (O(N) HashMap insert per node).
         Any shape recurring ≥ 2 times is a candidate pattern.

      2. For each candidate group, compute:
           · subgraphSize   — reachable IR-node count from the first instance.
                              All instances share identical topology ⟹ same count.
           · extractable    — shapeKey ≠ "·"  (non-trivial pattern)
                              AND closedCount ≥ 2  (comonad law on ≥ 2 instances).
             `closedCount` = number of instances whose reachable subgraph
             satisfies `isClosedUnderDeps` — the `extract` validity condition.

      3. Build ClusterInstances with slot arrays:
           pos 0 = root slot, pos 1..n = direct dep slots.
           Labels and statuses are resolved from nodeMap in O(1).

    The `duplicate` condition from the comonadic theory is satisfied by
    construction: a cluster with k instances witnesses k "copies" of the
    same structural sub-object within the proof — `duplicate : W A → W (W A)`
    applied at the pattern level. -/
def findClusters
    (nodes   : Array ProofNode)
    (nodeMap : Std.HashMap ContentHash ProofNode) : Array SimilarityCluster := Id.run do
  -- Step 1: group by topology fingerprint
  let mut groups : Std.HashMap String (Array ProofNode) := {}
  for n in nodes do
    groups := groups.insert n.shapeKey (groups.getD n.shapeKey #[] |>.push n)

  let mut clusters : Array SimilarityCluster := #[]
  let mut nextId   : Nat                     := 0

  for (shapeKey, grpNodes) in groups.toList do
    if grpNodes.size < 2 then continue

    -- Step 2: subgraph size and closure check
    let subgraphSize :=
      (reachableFrom nodeMap grpNodes[0]!.contentId).toList.length

    let closedCount : Nat :=
      grpNodes.filter (fun n =>
        isClosedUnderDeps nodeMap (reachableFrom nodeMap n.contentId)) |>.size

    -- Step 3: build instances with position-indexed slot arrays
    let instances : Array ClusterInstance := grpNodes.map fun n =>
      let rootSlot : Slot :=
        { pos := 0, contentId := n.contentId, label := n.label, status := n.status }
      let depSlots : Array Slot :=
        n.depIds.mapIdx fun i depCid =>
          { pos       := i + 1
            contentId := depCid
            label     := (nodeMap.get? depCid).map (·.label)  |>.getD ""
            status    := (nodeMap.get? depCid).map (·.status) |>.getD "dep" }
      { root := n.contentId, slots := #[rootSlot] ++ depSlots }

    clusters := clusters.push
      { id          := nextId
        shapeKey    := shapeKey
        size        := subgraphSize
        extractable := shapeKey != "·" && closedCount >= 2
        instances   := instances }
    nextId := nextId + 1

  return clusters

-- ════════════════════════════════════════════════════════════════════════════
-- § 8  Metric computation  (pure)
-- ════════════════════════════════════════════════════════════════════════════

/-- Nodes with no in-edges — the entry points of the proof graph.
    Used as DFS roots for maxDepth computation. -/
private def findRoots (nodes : Array ProofNode) : Array ProofNode :=
  let allDeps : Std.HashSet ContentHash :=
    nodes.foldl (fun s n => n.depIds.foldl (·.insert) s) {}
  nodes.filter fun n => !allDeps.contains n.contentId

/-- DFS max-depth from a single root, fuel-bounded to 2000.
    Returns 0 for leaves; 1 + max(children) otherwise. -/
private def maxDepthFrom
    (nodeMap : Std.HashMap ContentHash ProofNode)
    (root    : ContentHash)
    (fuel    : Nat := 2000) : Nat :=
  let rec go : ContentHash → Nat → Nat
    | _,   0     => 0
    | cid, k + 1 =>
      match nodeMap.get? cid with
      | none   => 0
      | some n =>
        if n.depIds.isEmpty then 0
        else 1 + n.depIds.foldl (fun acc d => max acc (go d k)) 0
  go root fuel

/-- Compute the metric vector from the prebuilt IR and clusters.

    selfSimilarity is capped at 1.0 because instances may share sub-nodes:
    summing (instances.size × size) can exceed nodeCount when the same
    sub-term appears as a dep of multiple cluster roots. -/
def computeMetrics
    (nodes    : Array ProofNode)
    (clusters : Array SimilarityCluster) : MetricVector :=
  let nodeCount := nodes.size
  let leafRatio :=
    if nodeCount = 0 then 0.0
    else (nodes.filter (·.depIds.isEmpty)).size.toFloat / nodeCount.toFloat
  let nodeMap := irToAdjMap nodes
  let maxD    :=
    findRoots nodes |>.foldl (fun acc n => max acc (maxDepthFrom nodeMap n.contentId)) 0
  let extractableCount := (clusters.filter (·.extractable)).size
  let covered :=
    clusters.foldl (fun acc c =>
      if c.extractable then acc + c.instances.size * c.size else acc) 0
  let selfSim :=
    if nodeCount = 0 then 0.0
    else min 1.0 (covered.toFloat / nodeCount.toFloat)
  { nodeCount      := nodeCount
    maxDepth       := maxD
    leafRatio      := leafRatio
    clusterCount   := clusters.size
    extractable    := extractableCount
    selfSimilarity := selfSim }

-- ════════════════════════════════════════════════════════════════════════════
-- § 9  graphId
-- ════════════════════════════════════════════════════════════════════════════

/-- Stable graph fingerprint: FNV-1a-64 of (theoremName ++ IR fingerprint).

    IR fingerprint = concat of (contentId ++ shapeKey ++ status) for every
    node in IR traversal order.  Two ProofGraphs with the same graphId are
    structurally identical modulo label renaming. -/
def computeGraphId (theoremName : String) (ir : Array ProofNode) : ContentHash :=
  hashPPExpr <|
    ir.foldl (fun acc n => acc ++ n.contentId ++ n.shapeKey ++ n.status) theoremName

-- ════════════════════════════════════════════════════════════════════════════
-- § 10  JSON serialization  (comonad/1 wire format)
-- ════════════════════════════════════════════════════════════════════════════

/-- Float → JSON number via fixed-point scaling to 6 decimal places.
    `Lean.JsonNumber` has `mantissa : Int` and `exponent : Nat`, where
    the represented value is `mantissa × 10^(-exponent)`.
    So 0.48 → { mantissa := 480000, exponent := 6 }.
    All metrics are non-negative so the UInt64 route is safe. -/
private def floatToJson (f : Float) : Lean.Json :=
  if f == 0.0 then .num ⟨0, 0⟩
  else .num { mantissa := Int.ofNat (f * 1000000.0).round.toUInt64.toNat
              exponent := 6 }

private def slotToJson (s : Slot) : Lean.Json :=
  .mkObj [ ("pos",       .num ⟨s.pos, 0⟩)
         , ("contentId", .str s.contentId)
         , ("label",     .str s.label)
         , ("status",    .str s.status) ]

private def instanceToJson (ci : ClusterInstance) : Lean.Json :=
  .mkObj [ ("root",  .str ci.root)
         , ("slots", .arr (ci.slots.map slotToJson)) ]

private def clusterToJson (c : SimilarityCluster) : Lean.Json :=
  .mkObj [ ("id",          .num ⟨c.id, 0⟩)
         , ("shapeKey",    .str c.shapeKey)
         , ("size",        .num ⟨c.size, 0⟩)
         , ("extractable", .bool c.extractable)
         , ("instances",   .arr (c.instances.map instanceToJson)) ]

private def nodeToJson (n : ProofNode) : Lean.Json :=
  .mkObj [ ("contentId", .str n.contentId)
         , ("shapeKey",  .str n.shapeKey)
         , ("label",     .str n.label)
         , ("status",    .str n.status)
         , ("depIds",    .arr (n.depIds.map .str)) ]

private def metricsToJson (m : MetricVector) : Lean.Json :=
  .mkObj [ ("nodeCount",    .num ⟨m.nodeCount, 0⟩)
         , ("maxDepth",     .num ⟨m.maxDepth, 0⟩)
         , ("leafRatio",    floatToJson m.leafRatio)
         , ("clusterCount", .num ⟨m.clusterCount, 0⟩)
         , ("extractable",  .num ⟨m.extractable, 0⟩)
         , ("selfSim",      floatToJson m.selfSimilarity) ]

/-- Serialize to the comonad/1 wire format (compressed JSON).

    `provenanceId` and `session` are always null here;
    the external language attaches them after the fact.

    Pantograph capture pattern:
      request  : {"file": "#findComonadsJSON Nat.add_comm",
                  "readHeader": false, "inheritEnv": true, "newConstants": false}
      response : units[0].messages[0].data  (severity = "information") -/
def proofGraphToJson (g : ProofGraph) : Lean.Json :=
  .mkObj [ ("schema",       .str g.schema)
         , ("hashAlgo",     .str g.hashAlgo)
         , ("theorem",      .str g.theoremName)
         , ("graphId",      .str g.graphId)
         , ("provenanceId", .null)
         , ("session",      .null)
         , ("metrics",      metricsToJson g.metrics)
         , ("ir",           .arr (g.ir.map nodeToJson))
         , ("clusters",     .arr (g.clusters.map clusterToJson)) ]

-- ════════════════════════════════════════════════════════════════════════════
-- § 11  Core MetaM pipeline
-- ════════════════════════════════════════════════════════════════════════════

/-- Full pipeline: constant name → ProofGraph.

    Steps:
      1. Resolve name → ConstantInfo  (error if absent or opaque)
      2. Explode proof value via `Mathlib.Explode.explode value true`
      3. Build content-addressed IR                (§ 6  — single MetaM pass)
      4. Detect comonadic clusters                 (§ 7  — pure)
      5. Compute metric vector                     (§ 8  — pure)
      6. Compute graphId                           (§ 9  — pure)
      7. Return ProofGraph envelope                (§ 2)

    All MetaM work is confined to steps 1–3.  Steps 4–7 are pure functions
    that can be tested, cached, and called from non-MetaM contexts. -/
def findComonadsCore (name : Name) : MetaM ProofGraph := do
  let info  ← getConstInfo name
  let value ← match info.value? with
    | some e => pure e
    | none   => throwError
        "ComonadFinder: '{name}' has no proof value (axiom or opaque?)"
  let entries  ← Mathlib.Explode.explode value true
  let ir       ← buildIR entries
  let nodeMap  := irToAdjMap ir
  let clusters := findClusters ir nodeMap
  let metrics  := computeMetrics ir clusters
  let graphId  := computeGraphId name.toString ir
  return { schema      := "comonad/1"
           hashAlgo    := "ppExpr/1"
           theoremName := name.toString
           graphId     := graphId
           metrics     := metrics
           ir          := ir
           clusters    := clusters }

-- ════════════════════════════════════════════════════════════════════════════
-- § 12  Elaborator commands
-- ════════════════════════════════════════════════════════════════════════════

/-- `#findComonads declName` — human-readable summary in InfoView.

    Example output:
      nodes=25, clusters=2, extractable=2, selfSim=0.480000
      graphId=a3f2c1b4e5d60789
        cluster [0]: size=3, x2 occurrences, topology ((·))
        cluster [1]: size=2, x3 occurrences, topology (·)          -/
elab "#findComonads " name:ident : command => do
  let g    ← liftTermElabM (findComonadsCore name.getId)
  let extr := (g.clusters.filter (·.extractable)).size
  let header : MessageData :=
    m!"nodes={g.metrics.nodeCount}, clusters={g.metrics.clusterCount}, \
       extractable={extr}, selfSim={g.metrics.selfSimilarity}\n\
       graphId={g.graphId}"
  let rows : List MessageData :=
    (g.clusters.mapIdx fun i c =>
      m!"  cluster [{i}]: size={c.size}, \
         x{c.instances.size} occurrences, topology {c.shapeKey}").toList
  logInfo (MessageData.joinSep (header :: rows) "\n")

/-- `#findComonadsJSON declName` — machine-readable comonad/1 JSON for Pantograph.

    Emits one compressed JSON line as an `information`-severity message.
    The external language (Mathematica / Python) reads:
      units[0].messages[0].data  -/
elab "#findComonadsJSON " name:ident : command => do
  let g ← liftTermElabM (findComonadsCore name.getId)
  logInfo m!"{(proofGraphToJson g).compress}"

-- ════════════════════════════════════════════════════════════════════════════
-- § 13  Tests (skipped on port — Lean 4 v4.30 changed `info.value?`
--             access for theorems; the original mm-lean smoke tests
--             ran against an older toolchain.  Restore by adding a
--             `Test/ComonadFinder.lean` once the access pattern is
--             updated for the current toolchain.)
-- ════════════════════════════════════════════════════════════════════════════

-- ════════════════════════════════════════════════════════════════════════════
-- § 14  Navigation helpers (for comonad.* API)
-- ════════════════════════════════════════════════════════════════════════════

/-- Get a single node by its contentId from the IR.
    Returns `none` if the node doesn't exist in the graph. -/
def getNodeInfo (graph : ProofGraph) (nodeId : ContentHash) : Option ProofNode :=
  graph.ir.find? (·.contentId == nodeId)

/-- Get all nodes in the subgraph rooted at `rootId`.
    Uses DFS to collect reachable nodes. Returns empty array if root not found. -/
def getSubgraph (graph : ProofGraph) (rootId : ContentHash) : Array ProofNode :=
  let nodeMap := irToAdjMap graph.ir
  let reachable := reachableFrom nodeMap rootId
  graph.ir.filter (fun n => reachable.contains n.contentId)

/-- Find all cluster IDs that contain the given node.
    A node is "in" a cluster if it appears as a root of any instance,
    or as a slot in any instance. -/
def findClustersContaining (graph : ProofGraph) (nodeId : ContentHash) : Array Nat :=
  graph.clusters.filter (fun c =>
    c.instances.any (fun inst =>
      inst.root == nodeId || inst.slots.any (·.contentId == nodeId)
    )
  ) |>.map (·.id)

/-- Get edges for a subgraph (for subgraph/1 schema). -/
def getSubgraphEdges (graph : ProofGraph) (rootId : ContentHash) : Array (ContentHash × ContentHash) := Id.run do
  let nodeMap := irToAdjMap graph.ir
  let reachable := reachableFrom nodeMap rootId
  let mut edges : Array (ContentHash × ContentHash) := #[]
  for cid in reachable.toList do
    if let some node := nodeMap.get? cid then
      for depId in node.depIds do
        if reachable.contains depId then
          edges := edges.push (cid, depId)
  return edges