@kha This one was crazy, the compiler created a join point for the
continuation of the match-expression. Each case of the match was
invoking the join point with a different parser. Two of the branches
were the partially applied `reader_t.bind` and `reader_t.orelse`.
This change did not improve the performance much, but it makes sure we
don't waste time trying to figure out why we have these two partial
applications in the call graph.
@kha This modification saved 150k object allocations on my machine.
BTW, the function
```
def command_parser.run (commands : list command_parser) (p : command_parser)
: parser_t command_parser_config id syntax :=
λ cfg, (p.run cfg).run_parsec $ λ _, any_of $ commands.map (λ p, p.run cfg)
```
is also affected by the problem I described at Zulip. It is another
example where eager eta-expansion is bad. Every time we call it, we
will create approx. 20 closures and 20 cons memory cells. We have at
least 600 commands in core.lean. So, just the `map` nested there will
generate 24k memory allocations. Moreover, the problem will get worse as we add
more commands.
Both `str` amd `raw_str` are used with string literals. This commit
makes sure we don't need to recompute the nested term
`dlist.singleton (repr s)`. This modification saves .2 secs when
parsing `core.lean` on my MacBook.
cc @kha
Now, the nodes in a `rbmap` contain the key and value, and we avoid
one level of indirection. `rbmap`s are more common than `rbtree`.
We implement `rbtree A` as `rbmap A unit`.
A few weeks ago, it was not feasible to inline `bind_mk_res` and
`orelse_mk_res` because the compilation time would increase a lot.
Since then I have improved the heuristics for deciding whether to float
`cases_on` or not.
So, I tried today to mark them with `@[inline]` again.
The corelib build time increased only 1.2 secs, but the `parser1.lean` runtime improved:
Before:
```
num. allocated objects: 18025367
num. allocated closures: 2988476
```
After:
```
num. allocated objects: 15774515
num. allocated closures: 2488695
```
I used my desktop to collect the numbers above.
We want them to be specialized for a given monad stack, but not
inlined. If we inline them, then every occurrence of `whitespace` and
`num` will specialize the nested `take_while?` application.
This is bad since we don't cache them.
Both `alternative` and `monad` implement `applicative`. However,
their default implementations for `seq_right` and `seq_left` are
different. The `alternative` implementation uses the inefficient default
version for `seq_right` available at `applicative`:
```
(seq_right := λ α β a b, const α id <$> a <*> b)
```
instead of the more efficient
```
(seq_right := λ α β x y, x >>= λ _, y)
```
defined at `monad` using the `bind` operator.
This commit makes sure the `applicative` instances for `reader_t`,
`state_t`, `option` and `parsec_t` use the efficient version.
I found the problem when inspecting the generated code for:
```
def symbol (s : string) : parsec' unit :=
(str s *> whitespace) <?> ("'" ++ s ++ "'")
```