@kha
I am using this little program to generate big lean files to test the
new front end. For the output produced for `gen 5000`, the new frontend
is almost 10x slower than the old one.
I used `valgrind --tool=callgrind` to collect profiling data.
The number of closures is too big. For example, `free_closure_obj` was
invoked 38.5 million times. The total number of deallocated objects is around
49.5 million.
@kha I have disabled this check. It was implemented 2 years ago by
Daniel, and I don't want to fix it. It seems you have already fixed a
bug there. AFAICT, this check is just for improving error messages.
I believe we may not even need it since the kernel now supports nested
inductive types. AFAIR, Daniel implemented this check here because the
inductive compiler was introducing a lot of auxiliary declarations
that were making the kernel error messages unreadable.
@kha I tried to make the examples self contained.
- unionfind1.lean uses the modified StateT and ExceptT
- unionfind2.lean uses the standard StateT and ExceptT
@kha The Lean rbmap is 3x slower than the C++ rb_map.
I tried two different implementations: rbmap2 (Appel's Coq
implementation), and rbmap3 (tries to simulate the C++ version).
I believe rbmap3 is broken since it is too slow.
I have also identified missing reset/reuse opportunities.
The actual implementation misses a case the simple code at reuse.txt gets :(
rbmap3 also exposes the "TODO(Leo): improve this" at library/compiler/reduce_arity.cpp
@kha I implemented a more complex example for the paper. The difference
in performance is awesome. On my Linux desktop:
Lean time: 3.34 secs
OCaml time: 11.72 secs
I believe that the difference in performance is due to destructive
updates happening in transformation functions such as `reassoc` and
`const_folding`. I will add a flag to Lean to disable `reuse/reset`
automatic insertion.
BTW, this test requires `ulimit -s unlimited` to avoid stack overflows.
@kha I wrote a simple test in Lean and OCaml. Right now, the numbers on
my machine are
arith_eval.ml 8.13 secs
arith_eval_nat.lean 10.71 secs
OCaml is computing with machine boxed integers, and we are computing
with `nat`. Our version is more expensive since we have to check
whether the number is small or big, and whether the result needs to be a
mpz value or not.
Almost half of our runtime is spent deallocating the big object returned
by `mk_expr`. The deferred free feature does not help here because
we don't deallocate the object in the end but as we execute `eeval`.
So, we perform many small invocations to `del`. None of them take
long, but the overall cost is super high. I can use a different strategy
where `del(o)` just updates the `g_to_free` list, and we deallocate
at most `LEAN_DEFERRED_FREE_QUOTA` at each allocation. The current
deferred free approach would also work if we could use the borrowed
annotations in `eeval`. In this case, we would not delete the input
expression as we evaluate it.
As an experiment, I manually added a `lean::inc` before invoking
`eeval`. The idea was prevent memory deallocation. With this
modification, the program runs in 5.87 secs.
BTW, I also wrote a version using uint32 (arith_eval_uint32.lean),
but the current compiler generates poor code for it.
I know how to fix the performance problem.
@kha I figured out why we had a long pause in the end of this benchmark
when using `11` instead of `9`. The function `deriv` was computing
`d := d "x" f` (the expensive computation), printing the size of `f` and
returning `d`. So, in the last step we were quickly printing the size
of the input 40230090 (when using `nest deriv 11 f`), and then computing
`d := d "x" f` which returns an object of size 374429936 which is never
used for anything.
That is, the pause had nothing to do with memory deallocation. I found
this issue after I implemented the deferred free feature which did not
fix the pause :)
@kha I have added `timeit` for running experiments for the paper.
We have to be careful because `timeit` may produce incorrect results
due to compiler optimizations (e.g., ground term extraction).
Here are examples that do not produce the result we expect:
```
def main : io uint32 :=
timeit "tst" (io.println' ("result " ++ to_string (tst 1000000))) *>
pure 0
```
```
def main (xs : list string) : io uint32 :=
timeit "tst" (io.println' ("result " ++ to_string (tst xs.head.to_nat))) *>
pure 0
```