Design notes for Lean4
----------------------

# Goals

- Move more code from C++ to Lean.
- New compiler and C++ code generator.
- New runtime (support for unboxed values and FFI).
- New parser and macro expander (in Lean).
- New monad for accessing primitives that are only available in C++ (e.g., `type_context`).
- Fix critical issues (e.g., issue #1601).
- Fix language design issues.
- Reduce clutter in the core lib and code base.

# Plan

- Create Lean4 branch
- Disable most tests (they will be incrementally added back as we make progress on Lean4).
- Dramatically reduce the size of core lib. We should only keep the basics that
are needed to execute Lean programs. Remove most theorems and lemmas,
algebraic hierarchy, all non basic tactics, etc. Motivations: reduce clutter,
and increase agility. We will copy `library` into `old_library` and incrementally
rebuild core lib.
- Remove dead C++ code, and disable all but the most basic tactics.
- Add abstraction layers to isolate modules. Example: module manager should not depend on the parser;
  equation compiler should not depend on the elaborator.
- Make Lean object thread safe (see Memory management section for performance issues),
and remove related clutter. Example: we will not need ts_vm_obj anymore.
Remove unnecessary closure objects that were added for the previous C++ code generator
that was discontinued.
- Split tactic state into backtrackable and non-backtrackable parts using the
new monad transformers. The plan is to have the following monads for meta programming:
  a) `elab`: it is morally state_t and except_t where the state is `elab_state`.
     `elab_state` is defined in C++, and contains: `environment`, cache data structures,
     `name_generator`, `options` and `metavar_context`. Only `metavar_context` is backtrackable, and
     we define `<|>` as
     ```
     meta def elab.orelse {α : Type} (e₁ : elab α) (e₂ : elab α) : elab α :=
     ⟨λ s, let mctx := s.get_mvar_ctx in
        match e₁.run s with
        | elab_result.exception α ex s' := e₂.run (s'.set_mvar_ctx mctx)
        | r                             := r
        end⟩
     ```
     Note that, `elab_state` is used linearly in the definition above, and we just save
     `metavar_context`.

  b) `tactic`: it built on top of `elab` using `state_t`. In the new state we store
      the list of goals (i.e., metavariables) and the main metavariable.

  We will implement (most of) the C++ primitive tactics in the `elab` monad.
  One of the motivations for the approach above is that C++ and Lean code use `elab` in very similar ways.
  For example, the `type_context` object in C++ just gets a reference for the `elab_state` object.

  Remark: `elab` monad will be defined in Lean, but we will mark it private. All tactics that
  have access to the private definition will use `elab_state` linearly. The idea is to make sure
  we can use destructive updates when implementing C++ primitive tactics. This is a big
  advantage with respect to the approach used in Lean3 where we keep creating (and deleting) `tactic_state` objects
  all over the place.

  We have considered splitting `elab` into two monads. The first one `elab_core` would have a state
  without `metavar_context`. `elab_core` would morally be `except_t ex (state elab_core_state)`. And then,
  `elab` would be defined as `state_t metavar_context elab_core`. So, `metavar_context` would be backtrackable
  for free, and we would not need to define a custom `orelse`. This change is fine on the Lean side, but it would
  be messy when programming in C++ because we would have to continue to propagate the `metavar_context` manually as
  we do in Lean3. Recall that a few bugs in Lean3 are due to incorrect propagation of the `metavar_context`. These bugs
  will all disappear with the new design.
  We also have additional performance overhead with the `elab`+`elab_core` approach.
  For example, we want to support the following API:
  ```
  mk_type_context : elab type_context
  infer_type : type_context -> expr -> elab expr
  whnf : type_context -> expr -> elab expr
  ...
  ```
  If we use `elab`+`elab_core` approach, the `infer_type` primitive would have to be implemented as:
  ```
  vm_obj infer_type(vm_obj const & ctx, vm_obj const & e, vm_obj const & mctx, vm_obj const & s) {
  try {
    ctx.set_mctx(mctx); // copy current metavar_context to ctx
    auto r = ctx.infer(e);
    return mk_success_result(r, ctx.mctx(), s); // build result using updated metavar_context
  } catch (exception & ex) {
    ....
  } }
  ```
  In the `elab` only approach, we save one argument `mctx`, and we don't have to copy the `metavar_context`.

- Implement support objects in Lean: options, format, structure trace messages, syntax object, etc.
- Add parser infrastructure in Lean.
- Compiler and C++ code generator. The C++ code generator will avoid many bootstrapping
problems we have. The idea is to write several Lean modules in Lean, emit C++ code, save
the generated C++ code in our repo.
- New IR with support for non uniform memory layout for Lean objects (see details on the #backend
Slack channel).
- Develop a tool in Lean that given a Lean inductive datatype (or structure) generates C++
code for retrieving fields and creating Lean objects. The goal is to isolate primitives implemented
in C++ from the way we represent Lean objects in memory. For example, most Lean functions implemented
in C++ use the C++ function `cfield` which assumes objects have a uniform memory layout.
- Develop a tool for generating glue code for interfacing Lean and C++ code. Again, the goal
is to isolate the primitives from the way we handle boxed/unboxed values.
For example, suppose we have a builtin function `foo` that takes two Lean `bool` values.
Right now, this function takes boxed values `c_foo(vm_obj const & a, vm_obj const & b)`.
It feels weird to have to box a Lean value to be able to invoke the builtin implementation for `foo`.
After we have the tool, we would write `c_foo(bool a, bool b)` and describe its signature in a Lean file.
The tool then generates the wrappers for invoking `c_foo` from the interpreter and generated C++ code.

# Language and library issues

- `private` declarations are not reliable. Users can easily subvert them
using meta programming. This is problematic for several optimizations
we want to use. For example, suppose we define a state-like monad
where every primitive uses the state linearly. The code
generator cannot rely on that since users can currently access
the internal implementation, and use the state in a non linear way.
This is just an example. We have many more.

- `parameter`s are currently simulated in Lean. For example,
when we declare `foo` in a section with a parameter `A`,
`foo` is automatically abstracted and an alias `foo => @foo A` is created.
This creates many problems, most of them are documented in the issue tracker.
This approach has one advantage: users can use the abstracted (`_root_.foo`)
and non-abstracted (`foo`) version simultaneously. Another advantage is that
we don't have to type check `foo` more than once in the kernel.
That being said, the disadvantages far outweigh the advantages.
We plan to go back to the approach used in Lean1 and Coq.

- Coercion resolution (see issue #1402).

- Name resolution for `[ ... ] tactic blocks.
The `[ ... ] notation allows us to use interactive tactic notation
when writing reusable tactics. This is very convenient, but the current
implementation uses dynamic name resolution, and is a source of many
bugs.

- `if-then-else` using `bool` instead of `Prop`.
As soon as we started programming with Lean (version 3), it became clear
that `if-then-else` with `Prop` creates more problems that it solves.
The elaborator already has support for a coercion from `Prop` to `bool`
(for decidable propositions). The dependent `if H : p then t else e`
may look cute, but it is unnecessary now that we have `match`.

- `decidable` type class. A recurrent problem in Lean occurs when
users perform dependent elimination on `decidable` instances.
The problem occurs when we have `[h : decidable p]` in the context
and a goal `G[@f p h]`, that is, a goal containing the term `@f p h`.
Then, we perform `cases h`, and obtain `G[@f p (dedicable.is_true h')]` in
one branch and `G[@f p (decidable.is_false h)]` in another. Then, we apply a
lemma that gives us `G[@f p h]` where `h` is the synthesized instance,
but we cannot use it to close the goal because we get a type error.
We have discussed this problem with Tahina and he pointed out that
we should never perform dependent elimination on type class instances in proofs,
and that this is an anti-idiom. He told us that every type class is a
structure in Coq. That is, everything is wrapped in a structure.
In Coq, they would use a custom eliminator for performing case analysis
on decidable instances. They don't face this problem because the custom eliminator
is more convenient to use them manually destructing the wrapper structure,
and then the actual data. He also strongly suggested that we
should decouple the program that computes whether a proposition is true or false
from the proof the result is correct. The Lean type class combines both in one single definition.
He said this will be a problem in the future for users that want to compute in the kernel.
 The kernel computations will have to deal with these proof terms. In Coq, `decidable` is now defined as:
```
Class Decidable (P : Prop) := {
  Decidable_witness : bool;
  Decidable_spec : Decidable_witness = true <-> P
}.
```
He strongly recommended we define `decidable` using this approach.
He said this is not the original definition used in Coq. The first one was a structure wrapping a
sum type (which is closer to our definition), and the Coq developers had to change it
because of performance problems in proofs by reflection.

- Interactive tactics. In Lean4 we will have a much more extensible and flexible
parser, and it will be written in Lean. We will be able to write a custom parser for the tactic
interactive mode. So, the current argument-type driven parser we use will not
be needed anymore.

- Quotations. Do we need all of them? I find the one with three backticks very inconvenient to use.
Moreover, as soon as we implement the new parser, we will want quotations for building
the new syntactic object and Lean expressions.

# Compiler

- The new compiler will use a System-F like intermediate representation.
It will be similar to Haskell core language. Inductive datatypes will be represented
using a constant for each constructor and a `cases` eliminator. If `cases` is encoded
using a expr-macro, we can easily support `default/other` case.

- Code inlining will occur at the System-F level after we have applied
simplifications.  This is relevant for the performance issues we have
observed when a long chain of functions need to be unfolded (e.g., new
monad transformer library).

- The first compilation step applies compiler specific simplification rules provided by users.
For example, we will be able to mark `map g (map f l) = map (g o f) l` as an optimization
rule for the compiler.

Issue: many opportunities for applying simplification rules only appear after we have inlined
definitions. However, we want to inline after we have converted into System-F and have
erased computationally irrelevant code and applied basic simplifications (e.g., erase trivial structures).
This is a problem since user provided simplification rules are not applicable here since they have
been described at the Lean level.

- Basic types (scalars, bool, char, uint32, uint64, int64, int32, ...) and C++ types
can be stored in unboxed form. The unboxed version are prefixed with `#` as they do
in Haskell.

- When we convert a Lean function to System-F, we will generate two versions: boxed and unboxed.
The boxed version is needed when passing this function to polymorphic higher-order functions.
As in Haskell, polymorphic functions always take boxed values.
Both versions are stored in the Lean environment as `meta` functions.
Example: the function `def inc (a : int32) := a + 1` is converted into two versions
`meta def _SystemF.boxed.inc (a : int32) := a + 1`, and `meta def _SystemF.unboxed.inc (a : #int32) := a #+ #1`.
Now, suppose we want to compile `twice inc a` where `def twice {A : Type} (f : A -> A) (a : A) := f (f a)`.
Then, since `twice` is polymorphic, we need to pass `inc` boxed version, and we generate
`@_SystemF.boxed.twice int32 _SystemF.boxed.inc a`.

- We want to implement monomorphisation as an additional optimization step. The idea is specialize functions like `twice`.
In the previous example, monomorphisation would generate `@_SystemF.unboxed.twice_int32 (f : #int32 -> #int32) (a : #int32)`.
We are considering caching monomorphised functions into the .olean files. If we do this, we have to consider the situation
where more than one .olean contains the same monomorphised function. We see two options: we have a canonical way to generate
names for monomorphised functions; we generate unique names, and accept the fact the environment will contain duplicates.
It is just a space issue.
Remark: see comment `closure` bullet point below. We may not use unboxed values in closures.

- In Lean3, `name`, `level` and `expr` are all implemented in C++. To expose these objects in Lean, we have to wrap them
using a subclass of `vm_external`. This generates a significant performance overhead. For example, suppose we have a lean
function that traverses a big expression; for each visited expression `e`, we need to create a `vm_expr` object for wrapping `e`.
Moreover, we have two layers of reference counting: one at `expr` and another at `vm_obj`.
This design decision is fine in Lean3 because the most expensive tasks are implemented in C++, and Lean code is only
used to "glue" together existing procedures implemented in C++. This is not the case in Lean4.
So, in Lean4, all these objects will be implemented directly in Lean.
As described above, we will have a tool that will generate C++ functions for creating and accessing these objects.
We believe this will not affect much how we code in C++, and it will eliminate a lot of boilerplate code we currently use.

# Object memory layout

- Constructor objects: they will contain pointers and unboxed data. We will use all-pointers first approach, and the header will contain 8 bytes:
  a) reference counter: 4 bytes
  b) kind: 1 byte
  c) tag (aka constructor index): 1 byte
  d) size (aka number of pointer objects): 2 bytes
  After the header we have `size` pointers and then all unboxed data. In debug mode, we want to store the size of the space used for
  unboxed data, and use this information for implementing sanity checks.
  Remark: the header of all composite Lean objects start with the reference counter and `kind`.
  Note that this representation supports only inductive datatypes with at most 256 constructors and 2^16 pointer fields.
  This is sufficient for our needs.
  If one day we want to support inductive datatypes with more than 256 constructors and/or 2^16 pointer fields, we can add
  a new kind of constructor object, and add new opcodes for manipulating them. We say this new kind is a "fat" constructor object
  since its header is bigger.

- Array: we support arrays of pointers and arrays of unboxed data. We will have opcodes for reading and updating arrays.
The updates are destructive when the array reference counter is 1.

- Closures: we need two kinds of closures: one that stores the bytecode id (for interpreted code); and another that stores a function pointer (for compiled code).
We need to decide whether we will support closures that store unboxed data or not. The simple solution is to support only boxed data
in closures. This may not be a huge performance overhead since many higher order functions such as `map` and `fold` will
be specialized during the monomorphisaton step, and we will not even create the closures.
A possible compromise is to support unboxed data only for closures that store function pointers. The idea is the following, whenever we emit C++ code for a function,
we also generate a `run` function for it s.t. given the closure data, it invokes the function. Then, we store the pointer to the `run` function in the closure.
For example, suppose we have generated
```
vm_obj foo(bool b1, vm_obj o, bool b2) { ... }
```
Then, we would also generate
```
vm_obj run_foo(closure_data d) {
  return foo(get_closure_bool(8, d), get_closure_obj(0, d), get_closure_bool(9, d));
}
```
`get_closure_bool(offset, d)` retrieves the Boolean stored at the given offset in `d`.
Note that, as in constructor objects, unboxed data is stored after pointers at `closure_data`.
So, we use `get_closure_bool(8, d)` to retrieve `b` which is stored after `o` (the pointer to `o` consumes 8 bytes) in `d`.
It is not clear this approach is a good one since we would always need to create a `closure_data` object that contains all arguments
before executing a closure. It is not clear how to support the optimization we use in Lean3 that avoids the allocation of the last `closure_data`
in most cases (https://github.com/leanprover/lean/blob/master/src/library/vm/vm.cpp#L1698).

- MPZ (multiprecision integers)

- String. Remark: internally it is not an array of char, but an array of bytes encoded in UTF8. The key difference with respect to Lean3 is that it will
not use the `vm_external` wrapper approach, but have an object kind for strings.

# IR

- Register based.
- Explicit reference counting instructions. We use reference counting for: composite objects, closures and
  C++ unboxed data (e.g., `expr`). Remark: for each C++ primitive the VM needs to know how to increase/decrease the reference counter.
- Support for unboxed values.
- Instructions for accessing unboxed values in non-uniform structures.
  We will have instructions for operations such as `get_scalar_<sz>(obj, offset)`,
  where `obj` is a (potentially non-uniform) Lean object, `offset` is the offset
  inside of this object, and `sz` is the number of bytes needed to store the object.
  In practice, we would have `get_scalar_1`, `get_scalar_2`, `get_scalar_4` and `get_scalar_8`.
  These scalars are unboxed. We would have registers for storing the different kinds of scalars,
  and basic operations on them (e.g., comparison, arithmetic, etc). So, we will
  have instructions such as `GETS_<sz> r_o r_i offset` where `r_o` and `r_i` are registers, and
  it corresponds to `r_o := get_scalar_<sz>(r_i, offset)`. Moreover, `r_o` must be a scalar
  register of size `sz` and `r_i` is a register for storing Lean objects.

Remark: for each constructor datatype, we will have a table that maps fields to the
operation needed to retrieve them. We will use this table when converting the SystemF
representation into the IR.

Open issue: should we use SSA or SIL?
BTW, most of the benefits of SSA/SIL seem to be irrelevant for Lean.
The paper https://arxiv.org/pdf/1507.05762.pdf send by Nuno describes the pros/cons for SSA.
Most of them seem to be related to static analysis procedures. I think we will need very few
static analysis steps, most optimizations will be implemented at System-F before we convert the code into IR.

# VM

- We need a new VM for the new IR.

- The VM should be able to invoke primitives hand written in C++ and
C++ code emitted by the Lean compiler.

- A few hand written C++ primitives and C++ code emitted by the Lean compiler
need to invoke Lean functions. We should be careful to avoid a mismatch here where
a C++ function F for Lean version `X` is trying to invoke bytecode for
a Lean function G for version `X+1`. If we allow this to happen the system
may crash because the data representation for version `X+1` may be
different from version `X`.

We can try to address this issue by breaking core lib into two parts.
The first part (bootstrapping) contains all the infrastructure needed by the parser, compiler,
tactic framework, and Lean runtime. If we make a change here, we should
compile it again using the previously emitted C++ code, and then generate
a new version of the C++ code, compile it, and check whether it works or not.
In principle, it is not safe to invoke bytecode generated during the current compilation
from previously emitted C++ code since they may be using different representations.
Of course, the changes may be harmless, but to avoid problems we should minimize
the number of tactics used in this part of the core lib. Ideally, tactics should
not be used in the bootstrapping part.

We may also emit C++ code for non essential functionality that is implemented in Lean,
and then link it with the Lean executable. Example: a decision procedure, a parser extension.
The idea is to provide a more efficient version to users. Here we can use a more
relaxed approach since this functionality is not part of the compiler. We can store a hash code
for each of these functions. When we import the .olean file that generated the function
we compare whether the hash code there matches the one in the emitted C++ code.
If it does, we use the C++ version, otherwise we use the bytecode.

# Explicit reference counting

As described above, we will have opcodes for increasing/decreasing `vm_obj`s reference counters.
In Lean3, we perform too many unnecessary increments/decrements.
A function is responsible for decreasing the reference counter of each argument (i.e., it consumes the argument), and
for increasing the reference counter of the result. Each argument should be viewed as a resource that is consumed by the function.
Many optimizations are possible. For example, consider the function
```
def proj2 (a : A) (b : B) := b
```
A naive compilation into bytecode would produce:
```
def proj2 a b :=
  r := b;
  inc r;
  dec a;
  dec b;
  return r
```
This can be optimized as
```
def proj2 a b :=
  dec a;
  return b
```
We need to explicitly decrement `a` because it was not use by `proj2`.
The function
```
def ex1 (x : X) (y : Y) :=
  let v1 := f x x,
      v2 := f x y
  in h v1 v2
```
is compiled as
```
def ex1 x y :=
  inc x;
  inc x;
  v1 := f x x;
  v2 := f x y;
  r  := h v1 v2;
  return r
```
The reference counter for `x` is incremented twice because it is used 3 times in this function.
Note that, we don't have to decrement `v1` nor `v2` since they are consumed by `h`.
We can an instruction `inc2 x` for executing `inc x; inc x` in a single step.
Now consider a map-like function
```
def my_map : list A -> list A
| []     := []
| (h::t) := g h :: my_map t
```
This function will be compiled as
```
def my_map t :=
switch (cidx t) {
case 0:
  return t;
case 1:
  h1 := get#0 t;
  t1 := get#1 t;
  dec t;
  h2 := g h1;
  t2 := my_map t2;
  r  := mk_list_cons h2 t2;
  return r
}
```
The instruction `get#<idx> t` returns the field `idx` of the object `t`. It bumps the reference counter of the resulting object, but does not update the reference counter of `t`.
`mk_list_cons h2 t2` is a function that creates a constructor object with tag `#1` and two pointers `h2` and `t2`. We don't need to decrement
the reference counters of `h2` and `t2` since they were "consumed" by `mk_list_cons`. We can inline `mk_list_cons` too.
We can also add a special `dec` instruction for reusing memory cells, and write `my_map` as:
```
def my_map t :=
switch (cidx t) {
case 0:
  return t;
case 1:
  h1 := get#0 t;
  t1 := get#1 t;
  t_cell := dec_core t;
  h2 := g h1;
  t2 := my_map t2;
  r  := mk_list_cons_reusing h2 t2 t_cell;
  return r
}
```
The `dec_core t` instruction decrements the reference counter of `t`, but does not delete the memory cell if it is zero.
Then, `mk_list_cons_reusing` will reuse `t`'s memory cell if the counter is 0, and will create a new cell otherwise.
With this trick, `my_map` will not allocate any constructor object if the input list is not shared.
To avoid memory leaks, we have to make sure that in each path after `dec_core t`, `t_cell` is used in a `mk_*_reusing` function
and/or we explicitly delete it using `del t_cell`.
When the reference counter of `t` is one in `case 1`, the `get#0 t` and `get#1 t` unnecessarily increase the reference counter of the result value just to decrease it again at `dec_core t`. We can avoid this overhead by using the following alternative formulation
```
def my_map t :=
switch (cidx t) {
case 0:
  return t;
case 1:
  if (ref_count t == 1) {
    h1     := steal#0 t;
    t1     := steal#1 t;
    t_cell := t;
  } else {
    h1     := get#0 t;
    t1     := get#1 t;
    dec t;
    t_cell := 0;
  }
  h2 := g h1;
  t2 := my_map t2;
  r  := mk_constructor_reusing #1 2 t_cell;
  set#0 r h2;
  set#1 r t2;
  return r
}
```
The instruction `steal#<idx> t` is similar to `get#<idx> t`, but it does not increase the reference counter of the resulting object.
`mk_constructor_reusing #1 2 t_cell` creates a constructor object with tag `#1` and size 2 and reusing `t_cell` if different from 0. The instructions `set#<idx>` are used to initialize the resulting fields.
The optimization above can be used whenever `t` is dead after the `case`, and an object of same size is created.
Note that if `t` is a list of arrays and it is not shared, then `g h1` will also be able to perform destructive updates.
Remark: suppose the `i`-th field in `case` branch is not used, then instead of using `steal#i t`, we use `dec#i t` to decrement the reference counter of the `i`-th field.

TODO: create experiments to check whether the optimization above is relevant or not.

# Tail recursion

TODO

# Unboxed products and sums

TODO

# Memory management

Lean3 VM objects are not thread safe: they do not use atomic
operations for updating the reference counter, and we use a small
object memory allocator. The main motivation was performance.
We evaluated these design decisions again, and did not observe any
performance impact when we used atomic operations for updating
the reference counter and removed the small object allocator.
The experiments were conducted using OSX and Linux.
We considered two benchmarks: core lib compilation, and a small Lean
program attached in the end of this section.
In both platforms and benchmarks no significant difference was observed.
Then, we disable the `memory_pool` object, and again no difference
in performance was observed.
We believe the memory allocators in the C++ runtime have been improved.
This is consistent with our observation that building Lean with `tcmalloc`
does not improve the performance significantly anymore.
However, it is not clear why using std::atomic does not impact performance
anymore.

```
def foo (n : nat) : nat :=
(((list.iota n).map (+10)).map (+30)).length

#eval nat.repeat (λ i _, foo i) 4000 0
```