Motivation: consider the following example
```
let m := if b then m1 else m2,
state.bind m (fun p, BIG p.1 p.2)
```
we first eta-expand the term, put in LCNF and inline state.bind
```
fun s,
let m := decidable.cases_on b (fun h, m1) (fun h, m2) in
let x_1 := m s in
prod.cases_on x_1 (fun a s', BIG a s')
```
then, we apply `*-of-cases` at `m := ...` and its continuation, but we
need a joint point since the continuation is big. Then, we get
```
fun s,
let j_1 := fun x_1,
let x_1 := m s in
prod.cases_on x_1 (fun a s', BIG a s') in
decidable.cases_on b
(fun h, let y := m1 in j_1 y)
(fun h, let y := m2 in j_1 y)
```
This code is not good if `m1` and `m2` are functions. At runtime, we
need to create a closure and pass it to the join point.
If we apply `app-of-cases` before other floating `cases` variants we
avoid this problem.
Here is the sequence of transformations if we apply `app-of-cases`
eagerly. We get
```
fun s,
let m := decidable.cases_on b (fun h, m1) (fun h, m2) in
let x_1 := m s in
prod.cases_on x_1 (fun a s', BIG a s')
```
as before. Then, we apply `app-of-cases` at `m s`, and get
```
fun s,
let x_1 := decidable.cases_on b (fun h, m1 s) (fun h, m2 s) in
prod.cases_on x_1 (fun a s', BIG a s')
```
Then, we apply `cases-of-cases`, but we again create a join point.
```
fun s,
let j_1 := fun x_1, prod.cases_on x_1 (fun a s', BIG a s') in
decidable.cases_on b
(fun h, let y := m1 s in j_1 y)
(fun h, let y := m2 s in j_1 y)
```
However, this time we are passing a value to `j_1` instead of a closure.
`app-of-cases` has two benefits:
1- It never creates new join points since applications are always small in LCNF
2- It may reduce a `cases` that returns a closure into a `cases` that
returns a value.
@kha This one was crazy, the compiler created a join point for the
continuation of the match-expression. Each case of the match was
invoking the join point with a different parser. Two of the branches
were the partially applied `reader_t.bind` and `reader_t.orelse`.
This change did not improve the performance much, but it makes sure we
don't waste time trying to figure out why we have these two partial
applications in the call graph.
@kha This modification saved 150k object allocations on my machine.
BTW, the function
```
def command_parser.run (commands : list command_parser) (p : command_parser)
: parser_t command_parser_config id syntax :=
λ cfg, (p.run cfg).run_parsec $ λ _, any_of $ commands.map (λ p, p.run cfg)
```
is also affected by the problem I described at Zulip. It is another
example where eager eta-expansion is bad. Every time we call it, we
will create approx. 20 closures and 20 cons memory cells. We have at
least 600 commands in core.lean. So, just the `map` nested there will
generate 24k memory allocations. Moreover, the problem will get worse as we add
more commands.
Both `str` amd `raw_str` are used with string literals. This commit
makes sure we don't need to recompute the nested term
`dlist.singleton (repr s)`. This modification saves .2 secs when
parsing `core.lean` on my MacBook.
cc @kha