From c81f4605e300d74d4e6a34e63aeada274ee5a541 Mon Sep 17 00:00:00 2001
From: Sebastian Ullrich <sebasti@nullri.ch>
Date: Fri, 29 May 2020 17:49:06 +0200
Subject: [PATCH] doc: Parser.lean

---
 src/Lean/Parser/Parser.lean | 67 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/src/Lean/Parser/Parser.lean b/src/Lean/Parser/Parser.lean
index 2268f28ab6..0d8af7a2f8 100644
--- a/src/Lean/Parser/Parser.lean
+++ b/src/Lean/Parser/Parser.lean
@@ -3,6 +3,56 @@ Copyright (c) 2019 Microsoft Corporation. All rights reserved.
 Released under Apache 2.0 license as described in the file LICENSE.
 Authors: Leonardo de Moura, Sebastian Ullrich
 -/
+
+/-!
+# Basic Lean parser infrastructure
+
+The Lean parser was developed with the following primary goals in mind:
+
+* flexibility: Lean's grammar is complex and includes indentation and other whitespace sensitivity. It should be
+  possible to introduce such custom "tweaks" locally without having to adjust the fundamental parsing approach.
+* extensibility: Lean's grammar can be extended on the fly within a Lean file, and with Lean 4 we want to extend this
+  to cover embedding domain-specific languages that may look nothing like Lean, down to using a separate set of tokens.
+* losslessness: The parser should produce a concrete syntax tree that preserves all whitespace and other "sub-token"
+  information for the use in tooling.
+* performance: The overhead of the parser building blocks, and the overall parser performance on average-complexity
+  input, should be comparable with that of the previous parser hand-written in C++. No fancy optimizations should be
+  necessary for this.
+
+Given these constraints, we decided to implement a combinatoric, non-monadic, lexer-less, memoizing recursive-descent
+parser. Using combinators instead of some more formal and introspectible grammar representation ensures ultimate
+flexibility as well as efficient extensibility: there is (almost) no pre-processing necessary when extending the grammar
+with a new parser. However, because the all results the combinators produce are of the homogeneous `Syntax` type, the
+basic parser type is not actually a monad but a monomorphic linear function `ParserState → ParserState`, avoiding
+constructing and deconstructing countless monadic return values. Instead of explicitly returning syntax objects, parsers
+push (zero or more of) them onto a syntax stack inside the linear state. Chaining parsers via `>>` accumulates their
+output on the stack. Combinators such as `node` then pop off all syntax objects produced during their invocation and
+wrap them in a single `Syntax.node` object that is again pushed on this stack. Instead of calling `node` directly, we
+usually use the macro `parser! p`, which unfolds to `node k p` where the new syntax node kind `k` is the name of the
+declaration being defined.
+
+The lack of a dedicated lexer ensures we can modify and replace the lexical grammar at any point, and simplifies
+detecting and propagating whitespace. The parser still has a concept of "tokens", however, and caches the most recent
+one for performance: when `tokenFn` is called twice at the same position in the input, it will reuse the result of the
+first call. `tokenFn` recognizes some built-in variable-length tokens such as identifiers as well as any fixed token in
+the `ParserContext`'s `TokenTable` (a trie); however, the same cache field and strategy could be reused by custom token
+parsers. Tokens also play a central role in the `prattParser` combinator, which selects a *leading* parser followed by
+zero or more *trailing* parsers based on the current token (via `peekToken`) and its associated precedence; see the
+documentation of `prattParser` for more details. Token precedences are specified via the `symbol` parser, or with
+`symbolNoWs` for tokens that should not be preceded by whitespace. The `Parser` type is extended with additional
+metadata over the mere parsing function to propagate token information: `collectTokens` collects all tokens within a
+parser for registering. `firstTokens` holds information about the "FIRST" token set used to speed up parser selection in
+`prattParser`. If multiple parsers accept the same current token, `prattParser` tries all of them using the backtracking
+`longestMatchFn` combinator. This is the only case where standard parsers might execute arbitrary backtracking. At the
+moment there is no memoization shared by these parallel parsers apart from the first token, though we might change this
+in the future if the need arises.
+
+Finally, error reporting follows the standard combinatoric approach of collecting a single unexpected token/... and zero
+or more expected tokens (see `Error` below). Expected tokens are e.g. set by `symbol` and merged by `<|>`. Combinators
+running multiple parsers should check if an error message is set in the parser state (`hasError`) and act accordingly.
+Error recovery is left to the designer of the specific language; for example, Lean's top-level `parseCommand` loop skips
+tokens until the next command keyword on error.
+-/
 prelude
 import Lean.Data.Trie
 import Lean.Data.Position
@@ -1548,7 +1598,22 @@ partial def trailingLoop (tables : PrattParsingTables) (c : ParserContext) : Par
         trailingLoop s
 
 /--
-  Implements a recursive precedence parser according to Pratt's algorithm.
+
+  Implements a recursive precedence parser according to Pratt's algorithm: select a parser (or more, via
+  `longestMatchFn`) from `leadingTable` based on the current token (falling back to unindexed `leadingParsers` such as
+  application), then chain with parsers from `trailingTable`/`trailingParsers` as long as the precedence of the current
+  token is higher than `rbp` of the parser state.
+
+  As an extension of the original algorithm, we also check the current token's precedence before calling the leading
+  parser(s). We do this so we can define an n-ary application parser:
+  ```
+  @[builtinTermParser] def app := tparser! many1 (namedArgument <|> termParser appPrec)
+  ```
+  If the nested `termParser appPrec` did not check the precedence of the first token, we would accept bogus input such as
+  `f x fun y => y` because we would never check the precedence of `fun`. Note that, in contrast to trailing tokens, a
+  leading token is accepted if its precedence is *at least* `rbp`. This is necessary so that `f x y` is not parsed as
+  `f (x y)`: the nested `termParser` call must accept the leading token `x` but not the trailing token `y`. But both have
+  the same precedence as identifiers, so we must use different comparisons.
 
   `antiquotParser` should be a `mkAntiquot` parser (or always fail) and is tried before all other parsers.
   It should not be added to the regular leading parsers because it would heavily