This PR updates our lexical structure documentation to mention the newly supported ⱼ which lives in a separate unicode block and is thus not captured by the current ranges.
180 lines
6 KiB
Markdown
180 lines
6 KiB
Markdown
Lexical Structure
|
||
=================
|
||
|
||
This section describes the detailed lexical structure of the Lean
|
||
language.
|
||
|
||
A Lean program consists of a stream of UTF-8 tokens where each token
|
||
is one of the following:
|
||
|
||
```
|
||
token: symbol | command | ident | string | raw_string | char | numeral |
|
||
: decimal | doc_comment | mod_doc_comment | field_notation
|
||
```
|
||
|
||
Tokens can be separated by the whitespace characters space, tab, line
|
||
feed, and carriage return, as well as comments. Single-line comments
|
||
start with ``--``, whereas multi-line comments are enclosed by ``/-``
|
||
and ``-/`` and can be nested.
|
||
|
||
Symbols and Commands
|
||
====================
|
||
|
||
.. *(TODO: list built-in symbols and command tokens?)*
|
||
|
||
Symbols are static tokens that are used in term notations and
|
||
commands. They can be both keyword-like (e.g. the `have
|
||
<structured_proofs>` keyword) or use arbitrary Unicode characters.
|
||
|
||
Command tokens are static tokens that prefix any top-level declaration
|
||
or action. They are usually keyword-like, with transitory commands
|
||
like `#print <instructions>` prefixed by the ``#`` character. The set
|
||
of built-in commands is listed in [Other Commands](./other_commands.md).
|
||
|
||
Users can dynamically extend the sets of both symbols (via the
|
||
commands listed in [Quoted Symbols](#quoted-symbols) and command
|
||
tokens (via the `[user_command] <attributes>` attribute).
|
||
|
||
.. _identifiers:
|
||
|
||
Identifiers
|
||
===========
|
||
|
||
An *atomic identifier*, or *atomic name*, is (roughly) an alphanumeric
|
||
string that does not begin with a numeral. A (hierarchical)
|
||
*identifier*, or *name*, consists of one or more atomic names
|
||
separated by periods.
|
||
|
||
Parts of atomic names can be escaped by enclosing them in pairs of French double quotes ``«»``.
|
||
|
||
```lean
|
||
def Foo.«bar.baz» := 0 -- name parts ["Foo", "bar.baz"]
|
||
```
|
||
|
||
```
|
||
ident: atomic_ident | ident "." atomic_ident
|
||
atomic_ident: atomic_ident_start atomic_ident_rest*
|
||
atomic_ident_start: letterlike | "_" | escaped_ident_part
|
||
letterlike: [a-zA-Z] | greek | coptic | letterlike_symbols
|
||
greek: <[α-ωΑ-Ωἀ-῾] except for [λΠΣ]>
|
||
coptic: [ϊ-ϻ]
|
||
letterlike_symbols: [℀-⅏]
|
||
escaped_ident_part: "«" [^«»\r\n\t]* "»"
|
||
atomic_ident_rest: atomic_ident_start | [0-9'ⁿ] | subscript
|
||
subscript: [₀-₉ₐ-ₜᵢ-ᵪⱼ]
|
||
```
|
||
|
||
String Literals
|
||
===============
|
||
|
||
String literals are enclosed by double quotes (``"``). They may contain line breaks, which are conserved in the string value. Backslash (`\`) is a special escape character which can be used to the following
|
||
special characters:
|
||
- `\\` represents an escaped backslash, so this escape causes one backslash to be included in the string.
|
||
- `\"` puts a double quote in the string.
|
||
- `\'` puts an apostrophe in the string.
|
||
- `\n` puts a new line character in the string.
|
||
- `\t` puts a tab character in the string.
|
||
- `\xHH` puts the character represented by the 2 digit hexadecimal into the string. For example
|
||
"this \x26 that" which become "this & that". Values above 0x80 will be interpreted according to the
|
||
[Unicode table](https://unicode-table.com/en/) so "\xA9 Copyright 2021" is "© Copyright 2021".
|
||
- `\uHHHH` puts the character represented by the 4 digit hexadecimal into the string, so the following
|
||
string "\u65e5\u672c" will become "日本" which means "Japan".
|
||
- `\` followed by a newline and then any amount of whitespace is a "gap" that is equivalent to the empty string,
|
||
useful for letting a string literal span across multiple lines. Gaps spanning multiple lines can be confusing,
|
||
so the parser raises an error if the trailing whitespace contains any newlines.
|
||
|
||
So the complete syntax is:
|
||
|
||
```
|
||
string : '"' string_item '"'
|
||
string_item : string_char | char_escape | string_gap
|
||
string_char : [^"\\]
|
||
char_escape : "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char{2} | "u" hex_char{4})
|
||
hex_char : [0-9a-fA-F]
|
||
string_gap : "\" newline whitespace*
|
||
```
|
||
|
||
Raw String Literals
|
||
===================
|
||
|
||
Raw string literals are string literals without any escape character processing.
|
||
They begin with `r##...#"` (with zero or more `#` characters) and end with `"#...##` (with the same number of `#` characters).
|
||
The contents of a raw string literal may contain `"##..#` so long as the number of `#` characters
|
||
is less than the number of `#` characters used to begin the raw string literal.
|
||
|
||
```
|
||
raw_string : raw_string_aux(0) | raw_string_aux(1) | raw_string_aux(2) | ...
|
||
raw_string_aux(n) : 'r' '#'{n} '"' raw_string_item '"' '#'{n}
|
||
raw_string_item(n) : raw_string_char | raw_string_quote(n)
|
||
raw_string_char : [^"]
|
||
raw_string_quote(n) : '"' '#'{0..n-1}
|
||
```
|
||
|
||
Char Literals
|
||
=============
|
||
|
||
Char literals are enclosed by single quotes (``'``).
|
||
|
||
```
|
||
char : "'" char_item "'"
|
||
char_item : char_char | char_escape
|
||
char_char : [^'\\]
|
||
```
|
||
|
||
Numeric Literals
|
||
================
|
||
|
||
Numeric literals can be specified in various bases.
|
||
|
||
```
|
||
numeral : numeral10 | numeral2 | numeral8 | numeral16
|
||
numeral10 : [0-9]+ ("_"+ [0-9]+)*
|
||
numeral2 : "0" [bB] ("_"* [0-1]+)+
|
||
numeral8 : "0" [oO] ("_"* [0-7]+)+
|
||
numeral16 : "0" [xX] ("_"* hex_char+)+
|
||
```
|
||
|
||
Floating point literals are also possible with optional exponent:
|
||
|
||
```
|
||
float : numeral10 "." numeral10? [eE[+-]numeral10]
|
||
```
|
||
|
||
For example:
|
||
|
||
```
|
||
constant w : Int := 55
|
||
constant x : Nat := 26085
|
||
constant y : Nat := 0x65E5
|
||
constant z : Float := 2.548123e-05
|
||
constant b : Bool := 0b_11_01_10_00
|
||
```
|
||
|
||
Note: that negative numbers are created by applying the "-" negation prefix operator to the number, for example:
|
||
|
||
```
|
||
constant w : Int := -55
|
||
```
|
||
|
||
Doc Comments
|
||
============
|
||
|
||
A special form of comments, doc comments are used to document modules
|
||
and declarations.
|
||
|
||
```
|
||
doc_comment: "/--" ([^-] | "-" [^/])* "-/"
|
||
mod_doc_comment: "/-!" ([^-] | "-" [^/])* "-/"
|
||
```
|
||
|
||
Field Notation
|
||
==============
|
||
|
||
Trailing field notation tokens are used in expressions such as
|
||
``(1+1).to_string``. Note that ``a.toString`` is a single
|
||
[Identifier](#identifiers), but may be interpreted as a field
|
||
notation expression by the parser.
|
||
|
||
```
|
||
field_notation: "." ([0-9]+ | atomic_ident)
|
||
```
|