This adds `set_option debug.byAsSorry true` and `decreasing_by sorry` to
various files to allow bootstrapping with Config structure changes. These
changes will be restored after the bootstrap dance is complete.
This PR renames `String.ValidPos` to `String.Pos`, `String.endValidPos`
to `String.endPos` and `String.startValidPos` to `String.startPos`.
Accordingly, the deprecations of `String.Pos` to `String.Pos.Raw` and
`String.endPos` to `String.rawEndPos` are removed early, after an
abbreviated deprecation cycle of two releases.
This PR removes all uses of `String.Iterator` from core, preferring
`String.ValidPos` instead.
In an upcoming PR, `String.Iterator` will be renamed to
`String.Legacy.Iterator`.
This PR splits some low-hanging fruit out of `Init.Data.String.Basic`:
basic material about `String.Pos.Raw`, `String.Substrig`, and
`String.Iterator`.
More splitting required and the remaining material is quite unorganized,
but it's a start.
This PR moves many operations involving `String.Pos.Raw` to a the
`String.Pos.Raw` namespace with the eventual aim of freeing up the
`String` namespace to contain operations using `String.ValidPos` (to be
renamed to `String.Pos`) instead.
This PR adds the `String.ValidPos.set` and `String.ValidPos.modify`
functions.
After this PR, `String.pos_lt_eq` is no longer a `simp` lemma. Add
`String.Pos.Raw.lt_iff` as a `simp` lemma if your proofs break.
This PR defines `ByteArray.validateUTF8`, uses it to show that
`ByteArray.IsValidUtf8` is decidable and redefines `String.fromUTF8` and
friends to use it.
The functions `String.validateUTF8` and `String.utf8DecodeChar?` are
deprecated in favor of the identically named functions in the
`ByteArray` namespace.
This PR introduces safe alternatives to `String.Pos` and `Substring`
that can only represent valid positions/slices.
Specifically, the PR
- introduces the predicate `String.Pos.IsValid`;
- proves several nontrivial equivalent conditions for
`String.Pos.IsValid`;
- introduces `String.ValidPos`, which is a `String.Pos` with an
`IsValid` proof;
- introduces `String.Slice`, which is like `Substring` but made from
`String.ValidPos` instead of `Pos`;
- introduces `String.Pos.IsValidForSlice`, which is like
`String.Pos.IsValid` but for slices;
- introduces `String.Slice.Pos`, which is like `String.ValidPos` but for
slices;
- introduces various functions for converting between the two types of
positions.
The API added in this PR is not complete. It will be expanded in future
PRs with addional operations and verification.
Hi, the doc of `String.fromUTF8` previously said invalid characters are
replaced with 'A'. But the parameter `h : validateUTF8 a` guarantees
there are no invalid characters, so that explanation doesn't make sense
to me. This PR deletes that explanation (and fixes some unrelated
typos).
I also have a patch that uses `h` to prove each of the characters is
valid, eliminating the need for a default character
([pr/chore-String-fromUTF8-prove-valid](27f1ff36b2)),
would you be interested in merging that?
<details>
<summary>Notes on invalid characters from unchecked C++</summary>
I don't know if this function may be called from unchecked C++ with
invalid characters. If it may, I'm not sure what would happen with my
patched function... I'm not familiar with Lean's safety model, but it
seems like a bad idea to have a Lean function that takes a proof of a
proposition but is expected to operate in a certain way even if the
proposition is false. I think the safe approach is to have two functions
-- one that takes a proof and is only called from Lean, and another that
doesn't take a proof and replaces invalid chars (for use from C++, not
sure whether it's useful from Lean); I'd prefer to go even further and
report an error instead of silently replacing invalid characters (I'm
not sure if there is any easy way to report errors/panic in Lean code
called from C++).
</details>
This PR redefines `String` to be the type of byte arrays `b` for which
`b.IsValidUtf8`.
This moves the data model of strings much closer to the actual data
representation at runtime.
In the near future, we will
- provide variants of `String.Pos` and `Substring` that only allow for
valid positions
- redefine all `String` functions to be much closer to their C++
implementations
In the near-to-medium future we will then provide comprehensive
verification of `String` based on these refactors.
This PR moves `String.utf8EncodeChar` to the prelude to prepare for the
imminent redefinition of `String`.
The definition in the prelude uses modulo and division operations on
natural numbers. In `String.Extra`, a `csimp` lemma is provided, showing
that the new definition is equal to the previous one (which is now
called `utf8EncodeCharFast`) which uses bitwise operations on `UInt8`.
This PR adjusts the experimental module system to make `private` the
default visibility modifier in `module`s, introducing `public` as a new
modifier instead. `public section` can be used to revert the default for
an entire section, though this is more intended to ease gradual adoption
of the new semantics such as in `Init` (and soon `Std`) where they
should be replaced by a future decl-by-decl re-review of visibilities.
This PR changes the `show t` tactic to match its documentation.
Previously it was a synonym for `change t`, but now it finds the first
goal that unifies with the term `t` and moves it to the front of the
goal list.
This PR adjusts the experimental module system to not export the bodies
of `def`s unless opted out by the new attribute `@[expose]` on the `def`
or on a surrounding `section`.
---------
Co-authored-by: Markus Himmel <markus@lean-fro.org>
This PR changes the signature of `Array.get` to take a Nat and a proof,
rather than a `Fin`, for consistency with the rest of the (planned)
Array API. Note that because of bootstrapping issues we can't provide
`get_elem_tactic` as an autoparameter for the proof. As users will
mostly use the `xs[i]` notation provided by `GetElem`, this hopefully
isn't a problem.
We may restore `Fin` based versions, either here or downstream, as
needed, but they won't be the "main" functions.
---------
Co-authored-by: David Thrane Christiansen <david@davidchristiansen.dk>
I made a few choices so far that can probably be discussed:
- got rid of `modn` on `UInt`, nobody seems to use it apart from the
definition of `shift` which can use normal `mod`
- removed the previous defeq optimized definition of `USize.size` in
favor for a normal one. The motivation was to allow `OfNat` to work
which doesn't seem to be necessary anymore afaict.
- Minimized uses of `.val`, should we maybe mark it deprecated?
- Mostly got rid of `.val` in basically all theorems as the proper next
level of API would now be `.toBitVec`. We could probably re-prove them
but it would be more annoying given the change of definition.
- Did not yet redefine `log2` in terms of `BitVec` as this would require
a `log2` in `BitVec` as well, do we want this?
- I added a couple of theorems around the relation of `<` on `UInt` and
`Nat`. These were previously not needed because defeq was used all over
the place to save us. I did not yet generalize these to all types as I
wasn't sure if they are the appropriate lemma that we want to have.
Continuation of #3958. To ensure that lean code is able to uphold the
invariant that `String`s are valid UTF-8 (which is assumed by the lean
model), we have to make sure that no lean objects are created with
invalid UTF-8. #3958 covers the case of lean code creating strings via
`fromUTF8Unchecked`, but there are still many cases where C++ code
constructs strings from a `const char *` or `std::string` with unclear
UTF-8 status.
To address this and minimize accidental missed validation, the
`(lean_)mk_string` function is modified to validate UTF-8. The original
function is renamed to `mk_string_unchecked`, with several other
variants depending on whether we know the string is UTF-8 or ASCII and
whether we have the length and/or utf8 char count on hand. I reviewed
every function which leads to `mk_string` or its variants in the C code,
and used the appropriate validation function, defaulting to `mk_string`
if the provenance is unclear.
This PR adds no new error handling paths, meaning that incorrect UTF-8
will still produce incorrect results in e.g. IO functions, they are just
not causing unsound behavior anymore. A subsequent PR will handle adding
better error reporting for bad UTF-8.
It seems:
* there was no actual need for the UInt32 valued version
* downstream we were getting duplicative lemmas about both
* so lets reduce the API surface area!
If anyone would prefer the remaining function is still called
`Char.utf8Size` I will happily change it. (`size` is hopefully still
unambiguous, and it's helpful to rename here so we can give a
deprecation warning that explains the type signature change.)
---------
Co-authored-by: Mac Malone <tydeu@hatpress.net>
To eliminate parsing differences between Windows and other platforms,
the frontend now normalizes all CRLF line endings to LF, like [in
Rust](https://github.com/rust-lang/rust/issues/62865).
Effects:
- This makes Lake hashes be faithful to what Lean sees (Lake already
normalizes line endings before computing hashes).
- Docstrings now have normalized line endings. In particular, this fixes
`#guard_msgs` failing multiline tests for Windows users using CRLF.
- Now strings don't have different lengths depending on the platform.
Before this PR, the following theorem is true for LF and false for CRLF
files.
```lean
example : "
".length = 1 := rfl
```
Note: the normalization will take `\r\r\n` and turn it into `\r\n`. In
the elaborator, we reject loose `\r`'s that appear in whitespace. Rust
instead takes the approach of making the normalization routine fail.
They do this so that there's no downstream confusion about any `\r\n`
that appears.
Implementation note: the LSP maintains its own copy of a source file
that it updates when edit operations are applied. We are assuming that
edit operations never split or join CRLFs. If this assumption is not
correct, then the LSP copy of a source file can become slightly out of
sync. If this is an issue, there is some discussion
[here](https://github.com/leanprover/lean4/pull/3903#discussion_r1592930085).
Because of the last-added-tried-first rule for macros, all the special
purpose `decreasing_trivial` rules are tried for most recursive
definitions out there, and because they use `apply` and `assumption`
with default transparency may cause some definitoins to be unfolded over
and over again.
A quick test with one of the functions in the leansat project shows that
elaboration time goes down from 600ms to 375ms when using
```
decreasing_by all_goals decreasing_with with_reducible decreasing_trivial
```
instead of
```
decreasing_by all_goals decreasing_with decreasing_trivial
```
This change uses `with_reducible` in most of these macros.
This means that these tactics will no longer work when the
relations/definitions they look for is hidden behind a definition.
This affected in particular `Array.sizeOf_get`, which now has a
companion `sizeOf_getElem`.
In addition, there were three tactics using `apply` to apply Nat-related
lemmas
that we now expect `omega` to solve. We still need them when building
`Init` modules
that don’t have access to `omega`, but they now live in
`decreasing_trivial_pre_omega`,
meant to be only used internally.
- [x] Depends on: #3958
- [x] Depends on: #3960
This makes the UTF-8 encode and decode functions have lean definitions,
so that we can prove properties about them downstream.
Previously, there was a function `opaque fromUTF8Unchecked : ByteArray
-> String` which would convert a list of bytes into a string, but as the
name implies it does not validate that the string is UTF-8 before doing
so and as a result it produces unsound results in the compiler (because
the lean model of `String` indirectly asserts UTF-8 validity). This PR
replaces that function by
```lean
opaque validateUTF8 (a : @& ByteArray) : Bool
opaque fromUTF8 (a : @& ByteArray) (h : validateUTF8 a) : String
```
so that while the function is still "unchecked", we have a proof witness
that the string is valid. To recover the original, actually unchecked
version, use `lcProof` or other unsafe methods to produce the proof
witness.
Because this was the only `ByteArray -> String` conversion function, it
was used in several places in an unsound way (e.g. reading untrusted
input from IO and treating it as UTF-8). These have been replaced by
`fromUTF8?` or `fromUTF8!` as appropriate.
[Before](https://github.com/leanprover/lean4/files/14772220/oi.pdf) and
[after](https://github.com/leanprover/lean4/files/14772226/oi2.pdf).
This gets `ByteArray`, `String.Extra`, `ToString.Macro` and `RCases` out
of the imports of `omega`. I'd hoped to get `Array.Subarray` too, but
it's tangled up in the list literal syntax. Further progress could come
from make `split` use available `Decidable` instances, so we could pull
out `Classical` (and possibly some of `PropLemmas`).