…rators
Right now those constructors result in a copy instead of the desired
move. We've measured that expr copying and assignment by itself uses
around 10% of total runtime on our workloads.
See #4698 for details.
This ports the `.below` and `.brecOn` constructions to lean.
I kept them in the same file, as they were in the C code, because they
are
highly coupled and the constructions are very analogous.
For validation I developed this in a separate repository at
https://github.com/nomeata/lean-constructions/tree/fad715e
and checked that all declarations found in Lean and Mathlib are
equivalent, up to
def canon (e : Expr) : CoreM Expr := do
Core.transform (← Core.betaReduce e) (pre := fun
| .const n ls => return .done (.const n (ls.map (·.normalize)))
| .sort l => return .done (.sort l.normalize)
| _ => return .continue)
It was not feasible to make them completely equal, because the kernel's
type inference code seem to optimize level expressions a bit less
aggressively, and beta-reduces less in inference.
The private helper functions about `PProd` can later move into their own
file, used by these constructions as well as the structural recursion
module.
this is the simplest of the constructions to be ported from C++ to Lean,
so I’ll PR this one first.
This begins to put each construction into its own file, as it was the
case with C++.
For validation I developed this in a separate repository at
https://github.com/nomeata/lean-constructions/tree/fad715e
and checked that all `.recOn` declarations found in Lean and Mathlib are
identical (per `==`) to the ones produced by the C code.
this is a first step towards porting the code `constructions.cpp` to
Lean: It leaves the construction of the `Declaration` untouched, but
moves adding the declarations to the environment, and setting various
attributes, to the Lean world.
This allows the remaining logic (the construction of the `Declaration`)
to be implemented in Lean separately and easily compared to the C++
implementation, before we replace that too.
To that end, `Declaraion` gains an `BEq` instance.
---------
Co-authored-by: Leonardo de Moura <leomoura@amazon.com>
Co-authored-by: Arthur Adjedj <arthur.adjedj@ens-paris-saclay.fr>
This fixes#2901, a bug in the old compiler which causes a segfault. The
issue is that when visiting `noConfusion` applications, it assumes that
each constructor case has `nfields` arguments, e.g. `head1 = head2 ->
tail1 = tail2 -> P` has two arguments because `List.cons` has 2 fields,
but in fact propositional fields are skipped by the noConfusion type
generator, so for example `Subtype.noConfusionType` is:
```lean
@[reducible] protected def Subtype.noConfusionType.{u_1, u} : {α : Sort u} →
{p : α → Prop} → Sort u_1 → Subtype p → Subtype p → Sort u_1 :=
fun {α} {p} P v1 v2 ↦
Subtype.casesOn v1 fun val property ↦ Subtype.casesOn v2 fun val_1 property ↦
(val = val_1 → P) → P
```
where `val = val_1 → P` only has the one argument even though
`Subtype.mk` has two fields, presumably because it is useless to have an
equality of propositions. Unfortunately there isn't any easy cache or
getter to use here to get the number of non-propositional fields, so we
just calculate it on the spot.
Again co-developed with @bollu.
Based on top of: #3225
While hunting down the performance discrepancy on qsort.lean between C
and LLVM we noticed there was a single, trivially optimizeable, alloca
(LLVM's stack memory allocation instruction) that had load/stores in the
hot code path. We then found:
https://groups.google.com/g/llvm-dev/c/e90HiFcFF7Y.
TLDR: `mem2reg`, the pass responsible for getting rid of allocas if
possible, only triggers on an alloca if it is in the first BB. The
allocas of the current implementation get put right at the location
where they are needed -> they are ignored by mem2reg.
Thus we decided to add functionality that allows us to push all allocas
up into the first BB.
We initially wanted to write `buildPrologueAlloca` in a `withReader`
style so:
1. get the current position of the builder
2. jump to first BB and do the thing
3. revert position to the original
However the LLVM C API does not expose an option to obtain the current
position of an IR builder. Thus we ended up at the current
implementation which resets the builder position to the end of the BB
that the function was called from. This is valid because we never
operate anywhere but the end of the current BB in the LLVM emitter.
The numbers on the qsort benchmark got improved by the change as
expected, however we are not fully there yet:
```
C:
Benchmark 1: ./qsort.lean.out 400
Time (mean ± σ): 2.005 s ± 0.013 s [User: 1.996 s, System: 0.003 s]
Range (min … max): 1.993 s … 2.036 s 10 runs
LLVM before aligning the types
Benchmark 1: ./qsort.lean.out 400
Time (mean ± σ): 2.151 s ± 0.007 s [User: 2.146 s, System: 0.001 s]
Range (min … max): 2.142 s … 2.161 s 10 runs
LLVM after aligning the types
Benchmark 1: ./qsort.lean.out 400
Time (mean ± σ): 2.073 s ± 0.011 s [User: 2.067 s, System: 0.002 s]
Range (min … max): 2.060 s … 2.097 s 10 runs
LLVM after this
Benchmark 1: ./qsort.lean.out 400
Time (mean ± σ): 2.038 s ± 0.009 s [User: 2.032 s, System: 0.001 s]
Range (min … max): 2.027 s … 2.052 s 10 runs
```
Note: If you wish to merge this PR independently from its predecessor,
there is no technical dependency between the two, I'm merely stacking
them so we can see the performance impacts of each more clearly.
Debugged and authored in collaboration with @bollu.
This PR fixes several performance regressions of the LLVM backend
compared to the C backend
as described in #3192. We are now at the point where some benchmarks
from `tests/bench` achieve consistently equal and sometimes ever so
slightly better performance when using LLVM instead of C. However there
are still a few testcases where we are lacking behind ever so slightly.
The PR contains two changes:
1. Using the same types for `lean.h` runtime functions in the LLVM
backend as in `lean.h` it turns out that:
a) LLVM does not throw an error if we declare a function with a
different type than it actually has. This happened on multiple occasions
here, in particular when the function used `unsigned`, as it was
wrongfully assumed to be `size_t` sized.
b) Refuses to inline a function to the call site if such a type mismatch
occurs. This means that we did not inline important functionality such
as `lean_ctor_set` and were thus slowed down compared to the C backend
which did this correctly.
2. While developing this change we noticed that LLVM does treat the
following as invalid: Having a function declared with a certain type but
called with integers of a different type. However this will manifest in
completely nonsensical errors upon optimizing the bitcode file through
`leanc` such as:
```
error: Invalid record (Producer: 'LLVM15.0.7' Reader: 'LLVM 15.0.7')
```
Presumably because the generate .bc file is invalid in the first place.
Thus we added a call to `LLVMVerifyModule` before serializing the module
into a bitcode file. This ended producing the expected type errors from
LLVM an aborting the bitcode file generation as expected.
We manually checked each function in `lean.h` that is mentioned in
`EmitLLVM.lean` to make sure that all of their types align correctly
now.
Quick overview of the fast benchmarks as measured on my machine, 2 runs
of LLVM and 2 runs of C to get a feeling for how far the averages move:
- binarytrees: basically equal performance
- binarytrees.st: basically equal performance
- const_fold: equal if not slightly better for LLVM
- deriv: LLVM has 8% more instructions than C but same wall clock time
- liasolver: basically equal performance
- qsort: LLVM is slower by 7% instructions, 4% time. We have identified
why the generated code is slower (there is a store/load in a hot loop in
LLVM that is not in C) but not figured out why that happens/how to
address it.
- rbmap: LLVM has 3% less instructions and 13% less wall-clock time than
C (woop woop)
- rbmap_1 and rbmap_10 show similar behavior
- rbmap_fbip: LLVM has 2% more instructions but 2% better wall time
- rbmap_library: equal if not slightly better for LLVM
- unionfind: LLVM has 5% more instructions but 4% better wall time
Leaving out benchmarks related to the compiler itself as I was too lazy
to keep recompiling it from scratch until we are on a level with C.
Summing things up, it appears that LLVM has now caught up or surpassed
the C backend in the microbenchmarks for the most part. Next steps from
our side are:
- trying to win the qsort benchmark
- figuring out why/how LLVM runs more instructions for less wall-clock
time. My current guesses would be measurement noise and/or better use of
micro architecture?
- measuring the larger benchmarks as well
These additional options are currently implemented in Std in a function
`Format.prettyExtra` (via `open private`), and used to implement the
`simp?` functionality.
This just adds the options to the core function.
Makes the LLVM triple of the current platform available to Lean code
towards a solution for #2754.
Defaults to the empty string if the compiler is not clang, which can
introduce some divergence between CI and local builds but should not be
noticeable in most cases and is not really possible to avoid.
This is an additional safety net on top of #2749: it protects users that
circumvent the build system (e.g. with `lake env`) as well as obviates
the need for TOCTOU-like race condition checks in the build system.
The check is activated by `CHECK_OLEAN_VERSION=ON`, which now defaults
to `OFF` as the sensible default for local development. When activated,
`USE_GITHASH=ON` is also force-enabled for stage 0 in order to make sure
that stage 1 can load its own core library.
* feat: add --target flag for LLVM backend to build objects of a different architecture
* chore: remove dead comment
* Update src/Lean/Compiler/IR/EmitLLVM.lean
Co-authored-by: Sebastian Ullrich <sebasti@nullri.ch>
* chore: normalize indentation in src/util/shell.cpp
* chore: strip trailing whitespace
Co-authored-by: Sebastian Ullrich <sebasti@nullri.ch>