4.7 KiB
Lexical Structure
This section describes the detailed lexical structure of the Lean language.
A Lean program consists of a stream of UTF-8 tokens where each token is one of the following:
token: symbol | command | ident | string | char | numeral |
: decimal | doc_comment | mod_doc_comment | field_notation
Tokens can be separated by the whitespace characters space, tab, line
feed, and carriage return, as well as comments. Single-line comments
start with --, whereas multi-line comments are enclosed by /-
and -/ and can be nested.
Symbols and Commands
.. (TODO: list built-in symbols and command tokens?)
Symbols are static tokens that are used in term notations and
commands. They can be both keyword-like (e.g. the have <structured_proofs> keyword) or use arbitrary Unicode characters.
Command tokens are static tokens that prefix any top-level declaration
or action. They are usually keyword-like, with transitory commands
like #print <instructions> prefixed by the # character. The set
of built-in commands is listed in Other Commands.
Users can dynamically extend the sets of both symbols (via the
commands listed in Quoted Symbols and command
tokens (via the [user_command] <attributes> attribute).
.. _identifiers:
Identifiers
An atomic identifier, or atomic name, is (roughly) an alphanumeric string that does not begin with a numeral. A (hierarchical) identifier, or name, consists of one or more atomic names separated by periods.
Parts of atomic names can be escaped by enclosing them in pairs of French double quotes «».
def Foo.«bar.baz» := 0 -- name parts ["Foo", "bar.baz"]
ident: atomic_ident | ident "." atomic_ident
atomic_ident: atomic_ident_start atomic_ident_rest*
atomic_ident_start: letterlike | "_" | escaped_ident_part
letterlike: [a-zA-Z] | greek | coptic | letterlike_symbols
greek: <[α-ωΑ-Ωἀ-῾] except for [λΠΣ]>
coptic: [ϊ-ϻ]
letterlike_symbols: [℀-⅏]
escaped_ident_part: "«" [^«»\r\n\t]* "»"
atomic_ident_rest: atomic_ident_start | [0-9'ⁿ] | subscript
subscript: [₀-₉ₐ-ₜᵢ-ᵪ]
String Literals
String literals are enclosed by double quotes ("). They may contain line breaks, which are conserved in the string value. Backslash (\) is a special escape character which can be used to the following
special characters:
\\represents an escaped backslash, so this escape causes one backslash to be included in the string.\"puts a double quote in the string.\'puts an apostrophe in the string.\nputs a new line character in the string.\tputs a tab character in the string.\xHHputs the character represented by the 2 digit hexadecimal into the string. For example "this \x26 that" which become "this & that". Values above 0x80 will be interpreted according to the Unicode table so "\xA9 Copyright 2021" is "© Copyright 2021".\uHHHHputs the character represented by the 4 digit hexadecimal into the string, so the following string "\u65e5\u672c" will become "日本" which means "Japan".
So the complete syntax is:
string : '"' string_item '"'
string_item : string_char | string_escape
string_char : [^\\]
string_escape: "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char{2} | "u" hex_char{4} )
hex_char : [0-9a-fA-F]
Char Literals
Char literals are enclosed by single quotes (').
char: "'" string_item "'"
Numeric Literals
Numeric literals can be specified in various bases.
numeral : numeral10 | numeral2 | numeral8 | numeral16
numeral10 : [0-9]+
numeral2 : "0" [bB] [0-1]+
numeral8 : "0" [oO] [0-7]+
numeral16 : "0" [xX] hex_char+
Floating point literals are also possible with optional exponent:
float : [0-9]+ "." [0-9]+ [[eE[+-][0-9]+]
For example:
constant w : Int := 55
constant x : Nat := 26085
constant y : Nat := 0x65E5
constant z : Float := 2.548123e-05
Note: that negative numbers are created by applying the "-" negation prefix operator to the number, for example:
constant w : Int := -55
Doc Comments
A special form of comments, doc comments are used to document modules and declarations.
doc_comment: "/--" ([^-] | "-" [^/])* "-/"
mod_doc_comment: "/-!" ([^-] | "-" [^/])* "-/"
Field Notation
Trailing field notation tokens are used in expressions such as
(1+1).to_string. Note that a.toString is a single
Identifier, but may be interpreted as a field
notation expression by the parser.
field_notation: "." ([0-9]+ | atomic_ident)