Writing a Verified Postfix Expression Calculator
SPARK is an Ada subset which can be used for formal verification and someone pitched a SPARK verified seedForth implementation on the Ada forums as a way of bootstrapping a new Ada compiler. I've heard about Forth in passing for a while, but never looked into it much before, so this sent me down a fun rabbit hole reading the first third of Starting Forth in a single night.
Forth looks really neat in its simplicity, but I don't have training in software verification and wanted to boil down the problem to a smaller subset, so I decided to write a postfix calculator in Ada/SPARK.
Why Ada/SPARK?
I've done quite a few side projects in Ada, and the SPARK subset works alongside it rather unobtrusively since the two can coexist in the same file. The SPARK tools also install directly from Alire, the Ada package manager and toolchain tool minimizing setup pain. SPARK also has a strong history of production code, such as in the Eurofighter Typhoon, and in cars and in firmware.
What does "verified" mean here?
I decided to pursue never overflowing numerically, having no runtime errors, and to obeying all given preconditions and postconditions. AdaCore has a whole level system describing various levels of proof that I joking call the "precious metal and mineral scale". What I'm trying to do looks like "Gold" level, the amalgamation of the lower levels which are "absence of run-time errors" and "initialization and correct data flow", plus "proof of key integrity properties."
Simplified:
- no uninitialized data usage
- no unintended global variable access
- no runtime exceptions can be raised
- all conditions are met when calling a subprogram (preconditions)
- all conditions are met when exiting a subprogram (postconditions)
I was also able to verify other things about program correctness, such as that +
leaves the appropriate result on the top of the stack, and also that the calculator will always terminate given a command string.
Operations of a Postfix Expression Calculator
Normally we're taught to write our mathematical operations using "infix" notation, where the operators are between the values to calculate, like "1 + 2". Postfix puts the operands first and then the operations, like "1 2 +".
The program processes inputs left to right, number values are "pushed" onto the top of the stack, and operations can "pop" values off the stack and push new values onto the stack. Operations can be defined arbitrarily in Forth to operate on and manipulate the stack, but for now I'm just going to stick some basic operations for a calculator.
+
- add the top two numbers and push the result-
- subtract the top two numbers and push the result*
- multiply the top two numbers and push the result/
- divide the top two numbers and push the resultnegate
- pop the top and push the negated valuedup
- duplicate the top value on the stack.
- print the top value on the stack
Adding, subtracting, multiplying and dividing each operate on the top two values of the stack. These operations remove them, calculate the result, and then place that value on the stack.
Process
In the same spirit of "Functional Core, Imperative Shell", I decided to go with "Verified Core, Unverified Shell" since Ada and SPARK code can coexist in the same project, I could verify the machine and maybe the lexer/parser and then write a thin layer on top for input and output.
The architecture is this:
- an unverified Ada wrapper for input/output
- a SPARK verified
Machine
, which is the calculator. I called it a "Machine" to not have to rename everything if I fork this project to try to implement some flavor of Forth - a SPARK verified
Scanner
, verified to produce tokens which contain no whitespace
Setup
If I want to write a program in Ada/SPARK I'm going to need a toolchain, and then I'm going to need a prover. Alire lets you install Ada toolchains, and it also lets you install GNATprove to run provers.
Installing gnatprove
is easy:
Now to run the prover:
This simple command line caused a lot of issues for me since it runs only one prover with default settings. A more complicated command line was needed later for proving.
Annotating code to prove
SPARK is an Ada subset, and you can mix Ada code and SPARK code. SPARK can be enabled and disabled on a per package and even a per subprogram (i.e. "function") basis using the SPARK_Mode
aspect.
Execution
Running the calculator should operate like this:
- while not done:
- read a line of text input
- split the text into tokens
- if the token is a number, push it onto the stack
- if the token is an operation, execute it
The wrapper can be written in plain Ada and call the appropriate SPARK code directly. This looks exactly as if the SPARK code were written in plain Ada:
Driver code:
This is what is being called:
Dangers of calling SPARK code
There's a weird bit in the above SPARK code which is the precondition for that SPARK procedure:
If you're using a verified component, you still need to abide by the precondition contracts expected by that code. Verification is useless if you don't follow the conditions under which it has been proved. The easiest way I found to abide by this is to limit the exposed public interface to subprograms with no preconditions.
Running with runtime contract checking enables finds and report these issues.
These can be turned on in the alire.toml
:
Preconditions, postconditions, invariants and static predicates are also all valid Ada code. You can add them in normal Ada code and enable runtime checks for them in unverified Ada code across the program or only for specific packages.
The easiest check you can write
The easiest check you can write is a description of "has no global dependencies."
This check can be enforced across an entire package instead by using the Pure
aspect. Pure
also works in regular Ada to ensure that a function or procedure only uses what is passed into it and that the package has no global state.
Initial steps and hurdles
I targeted writing the calculator "machine" first. The basic iteration loop was to write code and then run GNATprove. It felt very similar to doing test-driven development. SPARK's "stone level" requires valid SPARK code, so there's no reason for a separate build step.
The machine description heavily leverages Ada/SPARK's type system, where the derived types (not derived as in object-oriented programming) are new nominal types which cannot be mixed, whereas subtypes are related types with additional constraints. The compiler and provers understand these constriants and use them for runtime and static range checks. SPARK takes this compiler information and supercharges it by feeding them into the provers, ensuring that assignments and operations pass bounds checks.
Floating point simplifications
Value
is separated from Bounded_Value
for later checking to ensure that calculations don't exceed bounds. Value
here ensures the full range of Bounded_Value
multiplication can be captured, without truly exceeding the bounds of the underlying 64-bit floating point type. This was a simplification for the floating point calculations since I could only prove the upper bound of A + B
within bounds using the conditions A <= Max - B
and B <= Max - A
. I would think the other side would be A + B >= Min
, which would be checkable as A >= Min - B
and B >= Min - A
but I wasn't able to get the provers to agree.
There are also issues with how operation results are verified since those don't take into account relative differences between the two values. Cases are missed, such as when a smaller floating point value is added to a much larger one and the mantissa range of the larger value doesn't cover the smaller value so no change is observed.
The checks as implemented in this calculator ensure stack values stay within the Bounded_Value
range and prevents infinities and NaNs.
Using enough provers, and iterating faster
A human would look at this code and probably immediately say that the assert should pass:
SPARK shields you from the provers, but the simple command line alr gnatprove
doesn't bring all provers to bear. While the prover will keep trying until the timeout, it fails since only one is run by default, with a very short timeout.
There are different numerical "levels" you can run which provide reasonable defaults and provide longer timeouts and use more provers. With the right options the above code proves just fine.
My default commandline eventually changed to provide more resources.
--level=2
to run multiple provers, -j12
to parallelize the proof across 12 threads.
If something cannot be proved, the provers will keep going until the timeout. While we can aspire to their level of dedication, it's often useful in iteration to reduce wait time if you know roughly what the expected number of steps you've been using.
For iteration speed, you should use --steps
to timeout quicker. The danger here is that on one occasion I was butting my head against the wall because I limited the step count too much and after removing --steps
it showed the proof worked, it just needed more steps.
My process
When writing the calculator "Machine" and the Scanner I didn't run the program often.
Proving code seems a lot easier when dealing with one failure at a time. My approach was similar to test-driven design, except to only focus on adding failing conditions one at a time, and then iterating over them.
I focused on proving smaller subprograms and on smaller subsets of the inputs and outputs first, and then broadening the proved inputs/outputs to cover all the cases I needed.
I'd specify a small set of preconditions that I'd expect to pass. I use a lot of asserts in other code I write (usually C++), so writing preconditions flows naturally for me. Preconditions get added with the Pre
aspect.
With some preconditions in place, I'd start adding postconditions. These describe what must be true at the end of a subprogram, and are annotated with the Post
aspect.
The weird 'Name
things are attributes in Ada. Foo'Result
is the Result of the function Foo
in the postcondition, and Bar'Old
is the original value of the Bar
variable on program entry. You can even apply 'Old
to records members, like Bar.Baz'Old
.
Conditional expressions
Ada allowing using case-when
and if
as expressions by encasing the condition within parentheses. A special case is the simple syntax for "A implies B":
This is logically equivalent to not A or B
and fits extremely well into logic for postconditions.
Expression functions
When dealing with internal state, often internal information was needed. SPARK seems to treat "expression functions" different from regular functions with regards to how it feeds information to the provers.
Normally you'd write an Ada function as a spec in one file,
and then a body in an implementation file
An expression function lets you specify the body as an expression inside parentheses, which is often more succinct.
When splitting some functions across spec and body files I ran into a lot of issues, so I originally wrote Machine
using a public exposed struct. Eventually I realized I could preserve encapsulation by forward declaring the function and then putting the expression body in the private section of the spec file, which cleared up quite a few issues.
Debugging proofs
Sometimes I'd run into cases where I'd struggle with a proof. These mechanisms are what I used to get things working.
Assume
You can force the prover to accept a fact with Assume
.
While my final program doesn't use Assume
at all, it was very helpful to verify a proof would work if I was struggling with an intermediate step I needed. This lead to a few cases where I changed my approach since the intermediate step I was trying to achieve was not useful to the goal.
Additional Asserts
Asserting provides extra checks to verify conditions. This was useful to check broader or specific subsets of inputs to help diagnose broad postconditions.
Sometimes I'd add an assert which would fail, which would indicate I was missing a precondition or a postcondition check.
Assert A LOT more than you need
When I'd get stuck on a particular proof, I'd do a pass over the program and add any additional preconditions and postconditions I could think of to better limit the expected inputs and outputs. Improving postconditions especially seemed to help with reasoning about the program and also with more proofs down the line. Conditions which appeared multiple times also made prime candidates to get refactored into expression functions.
Contract_Cases
Instead of Post
for conditions you can also use Contract_Cases
to subdivide known input sets of a subprogram into separate postconditions. When the left-hand side is true on subprogram entry the right-hand side must be true on exit.
This segmentation is shown well in how Push
guarantees an overflow if the stack is full, or otherwise adds to the stack, with all other elements below the top being preserved:
Sharp edges
Private subprograms and invariants
Normally a value of a type is expected to abide by given invariants given on that type. The caveat is that private subprograms can violate these requirements, which can lead to the obscure message "invariant check failed."
Note that it might not be that the invariant check are violated, it could be that the invariant might not be guaranteed in the input to that subprogram or guaranteed after that subprogram.
A possible resolution to this issue is providing an appropriate precondition on the value which guarantees the invariant, or an appropriate postcondition on the result or out
parameter that guarantees the invariant. Making the invariant a function simplifies this and provides a common way to enforce these checks.
Loop invariants
Loop invariants ensure conditions are met the entry, exit and every iteration of a loop. An issue I ran across is needing to add every condition checked by something used by the loop. This makes sense, but tended to be a lot more information than I expected.
This tends to be more of an issue when mixing public and private subprograms in loops, since the private subprograms don't guarantee invariants of private types on exit and the public subprograms will expect these invariants to be true.
Consistency isn't guaranteed when accessing only a single element in an array or record, so I struggled with the Lexeme_Range
invariant in Tokenize
.
Tokenize
chops up a string and returns tokens. It only processes a single token every loop, but this quantified expression was necessary to guarantee the type invariant for all the other token lexeme ranges during the loop:
I also had to rewrite some loops to ensure that the loop invariants are expressable. I struggled getting Tokenize
to prove that it terminated and had to relax the bounds of the types I was using and use a side-effect producing function.
Most of Tokenize
is the loop invariant.
Effects of Machine Arithmetic
The coolest part of this is that the results for the arithmetic operations of the calculator are checked! This actually guarantees things like the order of operands pulled from the stack in subtraction, which I originally did incorrectly.
How was writing formally verified code?
Writing this project had huge hurdles before I started prefering expression functions and understood the interactions between loop invariants, type invariants and private subprograms.
The feedback loop of this method of programming is incredible. It felt very weird at first, but helped push me towards more conceptually consistent refactors which helped further checks. I had to put some hard limits in place for input lengths and number of tokens returned which I'd rather not.
Overall, this was a very enjoyable project and I'm seriously considering trying to fork and extend this into some small Forth interpreter.