cubical-transport-hott-lean4/docs/NUMERICAL.md

# Topolei — Numerical Implementation Principles

*A reference for how numerical schemes, contracts, and realizations are
structured so that **mathematical content is never fused with
execution context**.  Establishes the invariants a scheme must satisfy
to be first-class in the registry.  Read this before writing any
numerical code in topolei — the separation discipline is the main
thing that prevents construction faults downstream.*

---

## 0. The Principle

> **A numerical implementation is a pair**: a mathematical content
> (context-free, transport-stable, provably typed) and an execution
> context (precision regime, parallelism, memory layout, hardware
> calibration).  **These must be separated syntactically** — by the
> FFI boundary — so they cannot leak into each other.

Corollary: any code that looks like numerics but mixes the two halves
is **not a scheme**.  It is an artifact that happens to produce
numbers.  Artifacts are allowed at the edge (e.g., calibration
benchmarks), but they live outside the scheme registry.

The justification is operational, not aesthetic: numerical bugs that
cost time to find are almost always *context leaks* — an `fp16`-
specific stability assumption baked into a "generic" solver, a
parallel-sum scheme that quietly depends on reduction order, a
normalization step whose error budget is expressed in ULPs rather
than in the target's structural norm.  Keeping the two halves
separated at the type level prevents these at the boundary, not after
the fact in code review.

---

## 1. Mathematical Content

**Definition.** The *mathematical content* of a numerical scheme is
everything that can be stated without reference to:

- a specific floating-point format (`fp16`, `fp32`, `fp64`, `bf16`),
- a specific reduction order or parallelism model,
- a specific memory layout (row-major, column-major, blocked, strided),
- a specific hardware cost (GFLOPS, bandwidth, latency),
- a specific backend (CUDA, wgpu, CPU reference, vectorised SIMD),
- a specific realization of its source/target types (dense array,
  sparse matrix, functional list, mesh data structure).

**Positively:** mathematical content is a `CTerm` with a `HasType`
derivation, together with the *propositions* about its structural
behavior (naturality, error bounds, Lipschitz constants, condition
numbers).  These propositions are stated as `Path`, `Prop`, or
proposition-valued `CTerm`s, never as tolerances-in-units.

**Worked example.** The mathematical content of "trapezoidal rule
quadrature" is:

```
scheme : (A : CType) → (f : A → ℝ) → (a b : A) → ℝ
         + naturality witness w.r.t. affine reparametrisation of A
         + error bound: |scheme(f, a, b) − ∫ f| ≤ (b−a)³/12 · sup |f″|
         + Lipschitz-in-f: |scheme(f, a, b) − scheme(g, a, b)|
                           ≤ (b−a) · ‖f − g‖_∞
```

Not mathematical content: "trapezoidal with `fp32` accumulator, SIMD-
4 wide, row-major input".  That's a realization.

---

## 2. Execution Context

**Definition.** The *execution context* is the residual irreducible
data that a realization must commit to in order to actually run.  It
has five axes, each of which is a separate concern:

| Axis | Examples |
|------|----------|
| **Precision regime** | `fp16`/`fp32`/`fp64`/`bf16`/posits/exact rationals |
| **Parallelism model** | serial, SIMD-n, multi-threaded, SIMT/CUDA-like, distributed |
| **Memory layout** | row/col-major, blocked, strided, compressed-sparse, mesh topology |
| **Hardware cost calibration** | GFLOPS per backend, bandwidth per memory tier, wall-clock per abstract cost unit |
| **Backend binding** | CPU reference, SPIR-V shader, WGSL shader, CUDA kernel, wgpu compute |

**Where it lives:** entirely in the Rust FFI module and/or as Lean-side
opaque types (`NonemptyType` wrappers around `@[extern]`
declarations).  The execution context never appears in a `CTerm`; the
type system cannot mention `fp16`.

**Why the split holds at the type level:** `CType` has constructors
`univ | pi | path | glue` — none of them can carry a precision label,
a parallelism tag, or a hardware handle.  The impossibility is
structural, not by discipline.  A scheme that needs to talk about
precision must do so in the Rust side's `ExecContext` struct, and
communicate with Lean only through the axiom interface.

---

## 3. The FFI Boundary as Firewall

The two halves are separated by a **C ABI boundary**.  The boundary
enforces three properties:

1. **Opacity.** Execution-context types are `opaque` in Lean
   (`opaque GPUContextPointer : NonemptyType`), so Lean code cannot
   inspect them structurally.

2. **Axiomatic discharge.** The Rust side provides `@[extern]`
   implementations for axiom-stated behaviors
   (`@[extern "topolei_cubical_eval"] opaque cubicalEval : CEnv → CTerm → CVal`).
   Each Rust function's behavior is specified by a Lean axiom it
   must satisfy (e.g., `eval_var` / `eval_lam` / … in
   `CubicalTransport/Eval.lean`, or `readback_vneu` / … in
   `CubicalTransport/Readback.lean`).  Step-level axioms are derived
   from the eval/readback contract via the Week 7 bridge — see
   `STATUS.md` in the topolei repo (`../topolei/docs/STATUS.md`).

3. **IO tagging.** Any effect (GPU dispatch, memory allocation,
   timing, parallelism) is wrapped in `IO` so it cannot leak into
   pure mathematical content.  A pure `CTerm` is guaranteed to have
   no execution-context dependency.

The boundary pattern in Lean:

```lean
-- Mathematical content: pure, axiomatic, Lean-only
axiom scheme_error_bound (s : SchemeRecord) (x : s.source) :
    ‖ s.apply x − s.exact x ‖ ≤ s.error_bound x

-- Execution context: opaque, IO-wrapped, Rust-backed
opaque RealizationHandle : NonemptyType
@[extern "topolei_realize_scheme"]
opaque realize (s : @& SchemeRecord) (ctx : @& ExecContext) :
    IO RealizationHandle
```

The axiom speaks only of mathematical content; the opaque + extern
speaks only of execution context; they meet nowhere structurally.

---

## 4. Schemes as Typed Morphisms

**A scheme is a record with four required components:**

```lean
structure Scheme where
  source   : CType                                -- structural input type
  target   : CType                                -- structural output type
  body     : CTerm                                -- the morphism itself
  body_ty  : HasType [] body (.pi source target)  -- typing witness
```

**Mandatory additional fields for first-class status (see §10):**

```lean
  nat_witness   : NaturalityWitness body source target
  error_bound   : ErrorContract source target
  cost_estimate : CostContract
  lipschitz     : Option LipschitzBound    -- if declared
```

**Schemes compose by term-level application composition:**

```lean
def Scheme.compose (g f : Scheme) (h : g.source = f.target) : Scheme := {
  source  := f.source,
  target  := g.target,
  body    := .lam "x" (.app g.body (.app f.body (.var "x"))),
  body_ty := ⟨... derived via HasType.lam + HasType.app ... ⟩,
  nat_witness   := nat_witness_compose g.nat_witness f.nat_witness,
  error_bound   := error_compose g.error_bound f.error_bound g.lipschitz,
  cost_estimate := cost_compose g.cost_estimate f.cost_estimate,
  ...
}
```

The non-trivial compositions are `error_compose` (needs Lipschitz /
condition-number data) and `nat_witness_compose` (path concatenation
via `.comp` on path types).  These live in `Cell/Contract.lean` once
that module is written.

---

## 5. Contracts

A contract is *metadata* attached to a scheme that declares its
structural behavior.  Four contract kinds exist:

### 5.1 Error contract

```lean
structure ErrorContract (src tgt : CType) where
  norm_on_target : CTerm                        -- norm inducing metric on tgt
  bound          : CTerm → CTerm                -- bound on ‖scheme(x) − exact(x)‖
  bound_proof    : ∀ x, WellTyped (bound x) ...
```

**Critical rule:** `norm_on_target` is a *structural* object, expressed
in terms of the target `CType` — not in ULPs, `fp32`-epsilons, or any
implementation-unit quantity.  A realization may report *additional*
implementation-specific error (rounding error ≤ n · ε_machine), but
this is separate from the scheme's declared error bound.

**Composition law:** for `g ∘ f`,
`error(g ∘ f)(x) ≤ Lip(g, y=f(x)) · error(f)(x) + error(g)(f(x))`.

### 5.2 Cost contract

```lean
structure CostContract where
  abstract_cost : CTerm → Nat                   -- in abstract cost units
  unit          : AbstractCostUnit              -- flops | memops | roundtrips
```

**Critical rule:** cost is expressed in *abstract* units.  Wall-clock
calibration (`AbstractCostUnit → Float seconds`) is a per-backend axiom
discharged by the execution-context layer, **never** stored on the
scheme itself.  Same scheme, two backends, different wall-clock; the
scheme doesn't care.

### 5.3 Lipschitz / condition-number contract

```lean
structure LipschitzBound (src tgt : CType) where
  constant  : CTerm                             -- K such that ‖s(x)−s(y)‖ ≤ K·‖x−y‖
  domain    : CTerm                             -- domain restriction, if any
```

Declared when relevant (continuous schemes) and used in error
composition.  Omitted for discrete / non-Lipschitz schemes; those
compose via weaker bounds (e.g., `sup`-norm worst-case).

### 5.4 Naturality witness

```lean
structure NaturalityWitness (body : CTerm) (src tgt : CType) where
  on_representation_changes :
    (e_src : EquivData) → (e_tgt : EquivData) →
    PathWitness ... body (transport-adjusted body)
```

**This is the transport-stability evidence.**  A scheme with a
non-trivial `on_representation_changes` is natural (applies to any
representative of its source); without one, the scheme is
representation-specific and must be wrapped by a transport adapter
before being used in a transport-stable context.

---

## 6. Realizations

A *realization* is a specific implementation of a scheme, pinned to
an execution context.  Multiple realizations of the same scheme
coexist and are provably interchangeable up to declared precision
differences.

```lean
structure Realization (s : Scheme) where
  backend       : Backend                       -- CPU ref | SPIR-V | WGSL | CUDA
  precision     : PrecisionRegime               -- fp16 | fp32 | fp64 | exact
  parallelism   : ParallelismModel
  memory_layout : MemoryLayout

  handle        : IO RealizationHandle          -- opaque, FFI-backed

  -- Specification of extra error beyond the scheme's own bound:
  impl_extra_error : CTerm → CTerm              -- e.g., n · ε_machine
  impl_extra_proof : ∀ x, ‖handle.run(x) − scheme.body(x)‖
                          ≤ impl_extra_error x
```

**Interchangeability:** two realizations `r1, r2` of the same scheme
satisfy:

```
‖r1.run(x) − r2.run(x)‖ ≤ r1.impl_extra_error x + r2.impl_extra_error x
```

This is a **theorem**, not an assumption: it falls out of the two
`impl_extra_proof` axioms by triangle inequality.  The scheme itself
is *the same object* in both cases — only the realization varies.

---

## 7. Naturality and Transport-Stability

A scheme `s : A → B` is **natural with respect to an equivalence
`e : A ≃ A'`** iff there exists a path-witness:

```
path : Path (A' → B) (transp (uaLine e A A') s) s'
```

where `s'` is the corresponding scheme on `A'`.  When the witness is
`refl` (the identity path), `s` commutes definitionally with
transport; when it's a non-trivial path, `s` commutes up to a
specified deformation, which is still enough for the registry's
composition rules to work.

**How to test naturality:** pick a non-trivial `e : A ≃ A'` (e.g., a
basis change on a vector type), transport `s` along `uaLine e`, and
check the result equals `s'` (the manually-written scheme on `A'`) up
to the path witness.  In Lean:

```lean
theorem my_scheme_natural (e : EquivData) :
    transp (uaLine e A A') s = s' :=
  ...
```

If no such theorem exists, the scheme is *not* natural and must be
flagged in the registry as representation-specific.

**Wrapping a representation-specific scheme:** provide an adapter
`s_wrapped := s ∘ transp (uaLine e_R R R')` where `e_R` is the
representation equivalence.  The wrapper is natural; the original
isn't.  Both coexist in the registry with different flags.

---

## 8. Preconditions as Propositions

Preconditions (regularity, topological constraints, algebraic
properties) are **propositions in the type system**, not implicit
runtime assumptions.  At the Lean level this is native:

```lean
def divide (x y : ℝ) (h : y ≠ 0) : ℝ := x / y
```

**At the embedded-cubical `CTerm` level**, our current non-dependent Π
limits direct encoding.  Preconditions attach externally as a `Prop`
parameter on the scheme record:

```lean
structure Scheme where
  ...
  precondition : CTerm → Prop
  apply_when_precondition :
    ∀ x, precondition x → WellTyped (.app body x) target
```

**Anti-pattern:** an unstated precondition that shows up as a runtime
check inside the scheme body.  Every precondition must be reflected
in the scheme's type or its `precondition` field.  Runtime checks are
allowed in the realization (defense-in-depth), not in the scheme.

---

## 9. Order-Dependent Operations

Reductions, floating-point summation, parallel sums, and other
operations whose result depends on execution order **must declare
their non-determinism in their contract**:

```lean
structure OrderSensitivity where
  orders_considered : List ExecutionOrder
  max_spread        : CTerm                     -- ‖result(o₁) − result(o₂)‖ ≤ max_spread
  max_spread_proof  : ∀ o₁ o₂ ∈ orders_considered,
                      ‖run_with_order o₁ − run_with_order o₂‖ ≤ max_spread
```

A scheme that touches parallel reductions without declaring order
sensitivity is **not first-class** — it violates naturality silently
by appearing order-invariant when it isn't.  The registry rejects it
on type-check.

---

## 10. Minimum Contract for First-Class Status

A scheme is **first-class** in the registry iff it declares:

1. `source : CType`
2. `target : CType`
3. `body : CTerm` with `HasType [] body (.pi source target)`
4. `nat_witness : NaturalityWitness body source target`
5. `error_bound : ErrorContract source target`
6. `cost_estimate : CostContract`
7. `precondition : CTerm → Prop` (may be `fun _ => True`)
8. `order_sensitivity : Option OrderSensitivity`
   (required non-none iff the scheme touches parallel reductions)

A scheme missing **any** of 1–7 is **second-class** — usable but not
transport-stable and not composable in the registry's planner.

Any `impl_extra_*` / realization-specific information lives on
`Realization`, not `Scheme`.  A scheme with a `PrecisionRegime` field
is **wrongly constructed** — it has leaked context into content.

---

## 11. Registry and Planner

**Registry shape:**

```lean
structure SchemeRegistry where
  schemes : HashMap (CType × CType) (List Scheme)
  -- plus: for each scheme, its declared contracts
```

**Graph view:** nodes = `(CType, property-proof bundle)`; edges =
schemes with contracts.  A missing mapping from `A` to `B` is a
disconnected subgraph — the planner will report it structurally.

**Planner API:**

```lean
inductive PlannerResult where
  | success (path : List Scheme) (total_cost : Nat) (total_error : CTerm)
  | gap (from_ : CType) (to : CType) (best_partial : List Scheme)
        (missing_edge_description : String)
```

The planner **never** returns `none` or `throw`.  Either it finds a
path satisfying the (cost, error) budget, or it returns a structured
gap report naming the specific unreachable node and the last
reachable node.  Silent degradation is a construction fault; the
planner's type signature forbids it.

**Dual objective:** minimise-cost-subject-to-error and minimise-
error-subject-to-cost are the primal and dual of the same Lagrangian
search.  The planner takes a composite objective `cost + λ · error`
and the caller picks `λ` from their budget; the same code path serves
both modes.

---

## 12. Construction Faults to Avoid

The anti-patterns that the principle prevents — explicit so reviewers
can name them:

### F1. Context leak into content

A scheme whose type mentions `fp16`, `CUDA`, `SIMD`, `row-major`, or
any execution-context term.  *Fix:* move it to `Realization`.

### F2. Implicit precondition

A scheme that works only for positive / non-zero / regular / convex /
strictly-feasible inputs without saying so in its type or
`precondition` field.  *Fix:* state it.

### F3. Error bound in wrong units

`error ≤ 5 ULPs` instead of `error ≤ K · ‖x‖`.  *Fix:* express in the
target's structural norm; the implementation's rounding error is
`impl_extra_error`.

### F4. Pseudo-natural scheme

A scheme that happens to be definable generically but whose
`NaturalityWitness` was never constructed because "it's obviously
natural".  Obviousness is not a witness; write the path.  *Fix:* if
natural, write the proof; if not, mark representation-specific.

### F5. Undeclared order-dependence

A scheme that sums in parallel without declaring `OrderSensitivity`.
*Fix:* declare it or use a deterministic-order variant.

### F6. Wall-clock cost

Cost contract in seconds instead of abstract cost units.  *Fix:*
abstract units in the scheme, wall-clock in the backend's calibration
axiom.

### F7. Silent planner degradation

Planner returning a "best effort" path when the budget isn't met,
without reporting the gap.  *Fix:* return a `PlannerResult.gap`;
making the caller choose to proceed.

### F8. Representation pretending to be math

A matrix type whose `CType` is really `ArrayFP32` — the layout has
crept into the structural level.  *Fix:* factor as `(Matrix_α : univ)
× (rep : Rep Matrix_α)`, where `rep` is a `Realization`-level choice.

### F9. Realization's `impl_extra_error` treated as scheme's bound

Using the realization's extra error budget as if it were the scheme's
declared bound.  *Fix:* separate them; reports must distinguish.

### F10. Missing interchangeability proof

Two realizations that aren't provably interchangeable — they may not
in fact compute the same function.  *Fix:* write the
`impl_extra_proof` for each; interchangeability is a theorem by
triangle inequality, and you can't skip the hypothesis.

---

## 13. Where This Sits in Topolei's Phase Roadmap

**Current state (2026-04-22):** Phase 1 Cubical Core closed.  The
machinery required to build the numerical layer — `CType`, `CTerm`,
`HasType`, `uaLine`, `EquivData`, transport axioms — is in place.

**This document is the specification** for the numerical layer that
will sit in `Cell/Numerical.lean` (and satellites: `Cell/Contract.lean`,
`Cell/Registry.lean`, `Cell/Planner.lean`).  None of those modules
exist yet; when written, they must satisfy §10's minimum contract and
avoid §12's construction faults.

**Blocking dependencies for first-class scheme support:**

1. **Σ-types on `CType`** (STATUS.md priority #5) — enables
   `(Matrix_α : univ) × (rep : Rep α)` factorization for F8, and the
   `(B : univ) × (norm : B → B → Float)` factorization for §5.1.
2. **Dependent Π on `HasType`** — enables preconditions as native
   types (F2) rather than external `Prop` fields.
3. **Glue-transport rule** — discharges `transp_ua` from axiom to
   theorem, which is needed for §7 naturality proofs to be
   computationally meaningful rather than axiom-level.

None of these require the Rust FFI.  All three are pure-Lean
extensions of the current axiom base.

**Rust FFI dependencies** (for execution, not formalization):

1. **Cubical evaluator backend** — lets the planner actually run the
   schemes on real inputs.
2. **Backend-specific realizations** — SPIR-V / WGSL / CPU-ref
   variants with calibrated wall-clock costs.

---

## 14. Checklist for a New Numerical Scheme

Before merging a scheme into the registry, verify:

- [ ] `source`, `target`, `body`, `body_ty` all typed without
  reference to any execution-context axis (§1).
- [ ] `nat_witness` provided, or scheme explicitly tagged
  representation-specific (§7).
- [ ] `error_bound` in terms of target's structural norm, not
  implementation units (§5.1, F3).
- [ ] `cost_estimate` in abstract cost units, not wall-clock
  (§5.2, F6).
- [ ] `precondition` explicit for any non-total input (§8, F2).
- [ ] `order_sensitivity` declared if reductions / parallelism
  involved (§9, F5).
- [ ] No field of the scheme mentions precision, backend, layout,
  parallelism, or hardware (§1, F1, F8).
- [ ] At least one `Realization` committed with `impl_extra_proof`
  (§6, F10).
- [ ] If multiple realizations, interchangeability theorem derived
  (§6).
- [ ] Planner integration: scheme appears as an edge in the
  registry graph (§11); if a planner search fails involving this
  scheme, the failure is a `PlannerResult.gap` (§11, F7).

If any checkbox is unchecked, the scheme is second-class.  A second-
class scheme may still be merged, but is not transport-stable and
will not be used automatically by the planner.

---

*This document is a principle, not a suggestion.  Any future
numerical work that deviates from it deviates deliberately, with a
written argument, recorded as an exception in STATUS.md.*