lean4-htt

History

Henrik Böving 50d661610d perf: LLVM backend, put all allocas in the first BB to enable mem2reg (#3244 ) Again co-developed with @bollu. Based on top of: #3225 While hunting down the performance discrepancy on qsort.lean between C and LLVM we noticed there was a single, trivially optimizeable, alloca (LLVM's stack memory allocation instruction) that had load/stores in the hot code path. We then found: https://groups.google.com/g/llvm-dev/c/e90HiFcFF7Y. TLDR: `mem2reg`, the pass responsible for getting rid of allocas if possible, only triggers on an alloca if it is in the first BB. The allocas of the current implementation get put right at the location where they are needed -> they are ignored by mem2reg. Thus we decided to add functionality that allows us to push all allocas up into the first BB. We initially wanted to write `buildPrologueAlloca` in a `withReader` style so: 1. get the current position of the builder 2. jump to first BB and do the thing 3. revert position to the original However the LLVM C API does not expose an option to obtain the current position of an IR builder. Thus we ended up at the current implementation which resets the builder position to the end of the BB that the function was called from. This is valid because we never operate anywhere but the end of the current BB in the LLVM emitter. The numbers on the qsort benchmark got improved by the change as expected, however we are not fully there yet: ``` C: Benchmark 1: ./qsort.lean.out 400 Time (mean ± σ): 2.005 s ± 0.013 s [User: 1.996 s, System: 0.003 s] Range (min … max): 1.993 s … 2.036 s 10 runs LLVM before aligning the types Benchmark 1: ./qsort.lean.out 400 Time (mean ± σ): 2.151 s ± 0.007 s [User: 2.146 s, System: 0.001 s] Range (min … max): 2.142 s … 2.161 s 10 runs LLVM after aligning the types Benchmark 1: ./qsort.lean.out 400 Time (mean ± σ): 2.073 s ± 0.011 s [User: 2.067 s, System: 0.002 s] Range (min … max): 2.060 s … 2.097 s 10 runs LLVM after this Benchmark 1: ./qsort.lean.out 400 Time (mean ± σ): 2.038 s ± 0.009 s [User: 2.032 s, System: 0.001 s] Range (min … max): 2.027 s … 2.052 s 10 runs ``` Note: If you wish to merge this PR independently from its predecessor, there is no technical dependency between the two, I'm merely stacking them so we can see the performance impacts of each more clearly.		2024-02-13 14:54:40 +00:00
..
compiler	perf: LLVM backend, put all allocas in the first BB to enable mem2reg (#3244 )	2024-02-13 14:54:40 +00:00
constructions	fix : make `mk_no_confusion_type` handle delta-reduction when generating telescope (#2501 )	2023-10-14 17:18:37 +11:00
annotation.cpp
annotation.h	chore: fix more typos in comments	2023-10-08 14:37:34 -07:00
aux_recursors.cpp
aux_recursors.h
bin_app.cpp
bin_app.h
class.cpp
class.h
CMakeLists.txt	chore: remove dead code	2022-06-15 12:43:59 +02:00
constants.cpp	fix: `elim_scalar_array_cases`	2022-07-24 14:46:46 -07:00
constants.h	fix: `elim_scalar_array_cases`	2022-07-24 14:46:46 -07:00
constants.txt	fix: `elim_scalar_array_cases`	2022-07-24 14:46:46 -07:00
expr_lt.cpp
expr_lt.h
expr_pair.h
expr_pair_maps.h
expr_unsigned_map.h
formatter.cpp
formatter.h
init_module.cpp
init_module.h
max_sharing.cpp
max_sharing.h
module.cpp	feat: embed and check githash in .olean (#2766 )	2023-11-27 10:24:43 +00:00
module.h
num.cpp
num.h
print.cpp	fix: pp projection indices starting at 1	2023-10-15 14:25:00 -07:00
print.h	chore: fix more typos in comments	2023-10-08 14:37:34 -07:00
profiling.cpp	fix: profiler threshold in C++	2023-04-10 16:57:54 +02:00
profiling.h
projection.cpp
projection.h
protected.cpp
protected.h
reducible.cpp
reducible.h
replace_visitor.cpp
replace_visitor.h
suffixes.h
time_task.cpp	feat: show typeclass and tactic names in profile output	2023-03-27 17:47:52 +02:00
time_task.h
trace.cpp	feat: additional options for Format.pretty (#3264 )	2024-02-07 23:25:21 +00:00
trace.h	chore: remove remnants of C++ `format`	2022-11-18 06:11:24 -08:00
util.cpp	feat: System.Platform.target (#3207 )	2024-01-24 12:11:00 +00:00
util.h