Writing a Verified Postfix Expression Calculator

SPARK is an Ada subset which can be used for formal verification and someone pitched a SPARK verified seedForth implementation on the Ada forums as a way of bootstrapping a new Ada compiler. I've heard about Forth in passing for a while, but never looked into it much before, so this sent me down a fun rabbit hole reading the first third of Starting Forth in a single night.

Forth looks really neat in its simplicity, but I don't have training in software verification and wanted to boil down the problem to a smaller subset, so I decided to write a postfix calculator in Ada/SPARK.

Why Ada/SPARK?

I've done quite a few side projects in Ada, and the SPARK subset works alongside it rather unobtrusively since the two can coexist in the same file. The SPARK tools also install directly from Alire, the Ada package manager and toolchain tool minimizing setup pain. SPARK also has a strong history of production code, such as in the Eurofighter Typhoon, and in cars and in firmware.

What does "verified" mean here?

I decided to pursue never overflowing numerically, having no runtime errors, and to obeying all given preconditions and postconditions. AdaCore has a whole level system describing various levels of proof that I joking call the "precious metal and mineral scale". What I'm trying to do looks like "Gold" level, the amalgamation of the lower levels which are "absence of run-time errors" and "initialization and correct data flow", plus "proof of key integrity properties."

Simplified:

no uninitialized data usage
no unintended global variable access
no runtime exceptions can be raised
all conditions are met when calling a subprogram (preconditions)
all conditions are met when exiting a subprogram (postconditions)

I was also able to verify other things about program correctness, such as that + leaves the appropriate result on the top of the stack, and also that the calculator will always terminate given a command string.

Operations of a Postfix Expression Calculator

Normally we're taught to write our mathematical operations using "infix" notation, where the operators are between the values to calculate, like "1 + 2". Postfix puts the operands first and then the operations, like "1 2 +".

The program processes inputs left to right, number values are "pushed" onto the top of the stack, and operations can "pop" values off the stack and push new values onto the stack. Operations can be defined arbitrarily in Forth to operate on and manipulate the stack, but for now I'm just going to stick some basic operations for a calculator.

+ - add the top two numbers and push the result
- - subtract the top two numbers and push the result
* - multiply the top two numbers and push the result
/ - divide the top two numbers and push the result
negate - pop the top and push the negated value
dup - duplicate the top value on the stack
. - print the top value on the stack

Adding, subtracting, multiplying and dividing each operate on the top two values of the stack. These operations remove them, calculate the result, and then place that value on the stack.

Process

In the same spirit of "Functional Core, Imperative Shell", I decided to go with "Verified Core, Unverified Shell" since Ada and SPARK code can coexist in the same project, I could verify the machine and maybe the lexer/parser and then write a thin layer on top for input and output.

The architecture is this:

an unverified Ada wrapper for input/output
a SPARK verified Machine, which is the calculator. I called it a "Machine" to not have to rename everything if I fork this project to try to implement some flavor of Forth
a SPARK verified Scanner, verified to produce tokens which contain no whitespace

Setup

If I want to write a program in Ada/SPARK I'm going to need a toolchain, and then I'm going to need a prover. Alire lets you install Ada toolchains, and it also lets you install GNATprove to run provers.

Installing gnatprove is easy:

alr with gnatprove

Now to run the prover:

alr gnatprove

This simple command line caused a lot of issues for me since it runs only one prover with default settings. A more complicated command line was needed later for proving.

Annotating code to prove

SPARK is an Ada subset, and you can mix Ada code and SPARK code. SPARK can be enabled and disabled on a per package and even a per subprogram (i.e. "function") basis using the SPARK_Mode aspect.

package Machines
   with SPARK_Mode => On

   -- ...

   function To_Machine_Op (Input : String) return Machine_Op
   with SPARK_Mode => Off;

   -- ...
end Machines;

Execution

Running the calculator should operate like this:

while not done:
- read a line of text input
- split the text into tokens
- if the token is a number, push it onto the stack
- if the token is an operation, execute it

The wrapper can be written in plain Ada and call the appropriate SPARK code directly. This looks exactly as if the SPARK code were written in plain Ada:

Driver code:

--  Pass user input into the scanner
Scanners.Load_Input (S, Input);
      -- ^
      -- This is SPARK code!

This is what is being called:

procedure Load_Input (Self : in out Scanner; Input : String)
with
  Depends        => (Self => +Input),
  Pre            => Input'Length <= Max_Input_Length,
  Contract_Cases =>
    (Input'Length = 0 =>
       Remaining_Characters (Self) = 0
       and then not Has_More_Characters (Self)
       and then Input_Size (Self) = 0,
     others           =>
       Remaining_Characters (Self) = Input'Length
       and then Has_More_Characters (Self)
       and then Input_Size (Self) = Input'Length);

Dangers of calling SPARK code

There's a weird bit in the above SPARK code which is the precondition for that SPARK procedure:

Pre => Input'Length <= Max_Input_Length,

If you're using a verified component, you still need to abide by the precondition contracts expected by that code. Verification is useless if you don't follow the conditions under which it has been proved. The easiest way I found to abide by this is to limit the exposed public interface to subprograms with no preconditions.

Running with runtime contract checking enables finds and report these issues.

raised ADA.ASSERTIONS.ASSERTION_ERROR : failed precondition from scanners.ads:50
[D:\dev\ada\postfix_calc\bin\postfix_calc.exe]
0x7ff7817ef2c6 Ada.Finalization at s-assert.adb:44
0x7ff78182bf3e Postfix_Calc at scanners.ads:50
0x7ff78182586f Postfix_Calc at postfix_calc.adb:21
0x7ff78182ade0 Postfix_Calc at b__postfix_calc.adb:254
0x7ff7817e133e __tmainCRTStartup at ???
0x7ff7817e1144 mainCRTStartup at ???
[C:\WINDOWS\System32\KERNEL32.DLL]
0x7ffd1a8de8d5
[C:\WINDOWS\SYSTEM32\ntdll.dll]
0x7ffd1b87c5da

These can be turned on in the alire.toml:

[build-switches]
Development.Contracts = "Yes"

Preconditions, postconditions, invariants and static predicates are also all valid Ada code. You can add them in normal Ada code and enable runtime checks for them in unverified Ada code across the program or only for specific packages.

The easiest check you can write

The easiest check you can write is a description of "has no global dependencies."

function Status (Self : Machine) return Machine_Status
with Global => null;

This check can be enforced across an entire package instead by using the Pure aspect. Pure also works in regular Ada to ensure that a function or procedure only uses what is passed into it and that the package has no global state.

package Scanners
  with SPARK_Mode => On, Pure
is

Initial steps and hurdles

I targeted writing the calculator "machine" first. The basic iteration loop was to write code and then run GNATprove. It felt very similar to doing test-driven development. SPARK's "stone level" requires valid SPARK code, so there's no reason for a separate build step.

The machine description heavily leverages Ada/SPARK's type system, where the derived types (not derived as in object-oriented programming) are new nominal types which cannot be mixed, whereas subtypes are related types with additional constraints. The compiler and provers understand these constriants and use them for runtime and static range checks. SPARK takes this compiler information and supercharges it by feeding them into the provers, ensuring that assignments and operations pass bounds checks.

type Machine_Status is (Ok, Stack_Overflow, Stack_Underflow, Value_Out_Of_Bounds);
type Value is new Interfaces.IEEE_Float_64;
subtype Bounded_Value is Value range -1.0e150 .. 1.0e150;

Max_Stack_Size : constant := 1024;
type Element_Count is new Integer range 0 .. Max_Stack_Size;
subtype Stack_Index is Element_Count range 1 .. Max_Stack_Size;
type Machine_Stack is array (Stack_Index) of Bounded_Value;

type Machine is record
   Status : Machine_Status := Ok;
   Stack  : Machine_Stack;
   Top    : Element_Count := 0;
end record;

Floating point simplifications

Value is separated from Bounded_Value for later checking to ensure that calculations don't exceed bounds. Value here ensures the full range of Bounded_Value multiplication can be captured, without truly exceeding the bounds of the underlying 64-bit floating point type. This was a simplification for the floating point calculations since I could only prove the upper bound of A + B within bounds using the conditions A <= Max - B and B <= Max - A. I would think the other side would be A + B >= Min, which would be checkable as A >= Min - B and B >= Min - A but I wasn't able to get the provers to agree.

There are also issues with how operation results are verified since those don't take into account relative differences between the two values. Cases are missed, such as when a smaller floating point value is added to a much larger one and the mantissa range of the larger value doesn't cover the smaller value so no change is observed.

The checks as implemented in this calculator ensure stack values stay within the Bounded_Value range and prevents infinities and NaNs.

Using enough provers, and iterating faster

A human would look at this code and probably immediately say that the assert should pass:

Self.Stack (Self.Top) := Element;
pragma Assert (Self.Stack (Self.Top) = Element);

SPARK shields you from the provers, but the simple command line alr gnatprove doesn't bring all provers to bear. While the prover will keep trying until the timeout, it fails since only one is run by default, with a very short timeout.

machines.adb:16:22: medium: assertion might fail
   16 |      pragma Assert (Self.Stack (Self.Top) = Element);
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  possible fix: precondition of subprogram at machines.ads:54 should mention Element
   54 |   procedure Push (Self : in out Machine; Element : Value)
      |   ^ here

There are different numerical "levels" you can run which provide reasonable defaults and provide longer timeouts and use more provers. With the right options the above code proves just fine.

My default commandline eventually changed to provide more resources.

alr gnatprove --level=2 -j12

--level=2 to run multiple provers, -j12 to parallelize the proof across 12 threads.

If something cannot be proved, the provers will keep going until the timeout. While we can aspire to their level of dedication, it's often useful in iteration to reduce wait time if you know roughly what the expected number of steps you've been using.

For iteration speed, you should use --steps to timeout quicker. The danger here is that on one occasion I was butting my head against the wall because I limited the step count too much and after removing --steps it showed the proof worked, it just needed more steps.

alr gnatprove -j12 --level=2 --steps=20000

My process

When writing the calculator "Machine" and the Scanner I didn't run the program often.

Proving code seems a lot easier when dealing with one failure at a time. My approach was similar to test-driven design, except to only focus on adding failing conditions one at a time, and then iterating over them.

I focused on proving smaller subprograms and on smaller subsets of the inputs and outputs first, and then broadening the proved inputs/outputs to cover all the cases I needed.

I'd specify a small set of preconditions that I'd expect to pass. I use a lot of asserts in other code I write (usually C++), so writing preconditions flows naturally for me. Preconditions get added with the Pre aspect.

function Peek (Self : Machine) return Value
with Global => null, Pre => not Is_Stack_Empty (Self);

With some preconditions in place, I'd start adding postconditions. These describe what must be true at the end of a subprogram, and are annotated with the Post aspect.

function Has_Next (Self : Scanner) return Boolean
with
  Post =>
    ((Has_Next'Result and then Remaining_Characters (Self) > 0)
     or else (not Has_Next'Result and then Remaining_Characters (Self) = 0)
     or else (not Has_Next'Result and then not Has_Input (Self)));

The weird 'Name things are attributes in Ada. Foo'Result is the Result of the function Foo in the postcondition, and Bar'Old is the original value of the Bar variable on program entry. You can even apply 'Old to records members, like Bar.Baz'Old.

Conditional expressions

Ada allowing using case-when and if as expressions by encasing the condition within parentheses. A special case is the simple syntax for "A implies B":

(if A then B)

This is logically equivalent to not A or B and fits extremely well into logic for postconditions.

--  All of these conditions in Pre and Post here are proved!
procedure Skip_Whitespace
  (Self : in out Scanner; Skipped_Whitespace : out Boolean)
with
  Pre  =>
    Is_Valid (Self)
    and then Has_Next (Self)
    and then Has_More_Characters (Self),
  Post =>
    Is_Valid (Self)
    and then (Self.State = (if Has_Next (Self) then Ready else Complete))
    and then (if Has_Next (Self) then not ACH.Is_Space (Peek (Self)))
    and then (if not Has_Next (Self)
              then
                Skipped_Whitespace and then not ACH.Is_Space (Peek (Self)))
    and then (if Skipped_Whitespace
              then
                Remaining_Characters (Self)
                < Remaining_Characters (Self'Old))
    and then ((Skipped_Whitespace
               and then Remaining_Characters (Self)
                        < Remaining_Characters (Self'Old))
              or else (not Skipped_Whitespace))
    and then (if not Skipped_Whitespace
              then
                Remaining_Characters (Self)
                = Remaining_Characters (Self'Old))
    and then (Lexeme_Size (Self) = 0);

Expression functions

When dealing with internal state, often internal information was needed. SPARK seems to treat "expression functions" different from regular functions with regards to how it feeds information to the provers.

Normally you'd write an Ada function as a spec in one file,

-- machines.ads, the spec file
function Stack_Size (Self : Machine) return Element_Count
with Global => null;

and then a body in an implementation file

-- machines.adb, the implementation file
function Stack_Size (Self : Machine) return Element_Count is
begin
   return Self.Top;
end Stack_Size;

An expression function lets you specify the body as an expression inside parentheses, which is often more succinct.

When splitting some functions across spec and body files I ran into a lot of issues, so I originally wrote Machine using a public exposed struct. Eventually I realized I could preserve encapsulation by forward declaring the function and then putting the expression body in the private section of the spec file, which cleared up quite a few issues.

package Machines
  with SPARK_Mode => On, Always_Terminates
is
   --  An opaque type with hidden internals.
   type Machine is private;

   function Stack_Size (Self : Machine) return Element_Count
   with Global => null;

   --  ... a lot of other code ...

private

   --  .. a lot of other code ...

   function Stack_Size (Self : Machine) return Element_Count
   is (Self.Top);

end Machines;

Debugging proofs

Sometimes I'd run into cases where I'd struggle with a proof. These mechanisms are what I used to get things working.

Assume

You can force the prover to accept a fact with Assume.

pragma Assume (X < 10);

While my final program doesn't use Assume at all, it was very helpful to verify a proof would work if I was struggling with an intermediate step I needed. This lead to a few cases where I changed my approach since the intermediate step I was trying to achieve was not useful to the goal.

Additional Asserts

Asserting provides extra checks to verify conditions. This was useful to check broader or specific subsets of inputs to help diagnose broad postconditions.

Sometimes I'd add an assert which would fail, which would indicate I was missing a precondition or a postcondition check.

pragma Assert (Self.Top in Stack_Index);

Assert A LOT more than you need

When I'd get stuck on a particular proof, I'd do a pass over the program and add any additional preconditions and postconditions I could think of to better limit the expected inputs and outputs. Improving postconditions especially seemed to help with reasoning about the program and also with more proofs down the line. Conditions which appeared multiple times also made prime candidates to get refactored into expression functions.

`Contract_Cases`

Instead of Post for conditions you can also use Contract_Cases to subdivide known input sets of a subprogram into separate postconditions. When the left-hand side is true on subprogram entry the right-hand side must be true on exit.

This segmentation is shown well in how Push guarantees an overflow if the stack is full, or otherwise adds to the stack, with all other elements below the top being preserved:

procedure Push (Self : in out Machine; Element : Bounded_Value)
with
  Global         => null,
  Depends        => (Self => +Element),
  Contract_Cases =>
    (Is_Stack_Full (Self) => Status (Self) = Stack_Overflow,
     others               =>
       not Is_Stack_Empty (Self)
       and then Stack_Size (Self) = Stack_Size (Self'Old) + 1
       and then Peek (Self) = Element
       and then (for all X in 0 .. Stack_Size (Self'Old) - 1
                 => Peek (Self, X + 1) = Peek (Self'Old, X)));

Sharp edges

Private subprograms and invariants

Normally a value of a type is expected to abide by given invariants given on that type. The caveat is that private subprograms can violate these requirements, which can lead to the obscure message "invariant check failed."

Note that it might not be that the invariant check are violated, it could be that the invariant might not be guaranteed in the input to that subprogram or guaranteed after that subprogram.

A possible resolution to this issue is providing an appropriate precondition on the value which guarantees the invariant, or an appropriate postcondition on the result or out parameter that guarantees the invariant. Making the invariant a function simplifies this and provides a common way to enforce these checks.

type Scanner is record
   Input  : String (1 .. Max_Input_Length);
   Start  : Range_Lower := 1;
   Cursor : Range_Upper := 1;
   Length : Range_Size := 0;
   State  : Scanner_State := Empty;
end record
with Invariant => Is_Valid (Scanner);

function Has_Valid_Cursor (Self : Scanner) return Boolean
is (Self.Start <= Self.Cursor and then Self.Cursor <= Self.Length + 1);

function Has_Valid_State (Self : Scanner) return Boolean
is (case Self.State is
      when Empty =>
        Self.Length = 0 and then Self.Start = 1 and then Self.Cursor = 1,
      when Complete => Self.Cursor = Self.Length + 1,
      when others =>
        Self.Start <= Self.Cursor and then Self.Cursor <= Self.Length + 1);

function Is_Valid (Self : Scanner) return Boolean
is (Has_Valid_Cursor (Self) and then Has_Valid_State (Self));

Loop invariants

Loop invariants ensure conditions are met the entry, exit and every iteration of a loop. An issue I ran across is needing to add every condition checked by something used by the loop. This makes sense, but tended to be a lot more information than I expected.

This tends to be more of an issue when mixing public and private subprograms in loops, since the private subprograms don't guarantee invariants of private types on exit and the public subprograms will expect these invariants to be true.

Consistency isn't guaranteed when accessing only a single element in an array or record, so I struggled with the Lexeme_Range invariant in Tokenize.

type Lexeme_Range is record
   Lower : Range_Index := 1;
   Upper : Range_Upper := 1;
end record
with Invariant => Lower <= Upper and then Upper - Lower <= Range_Size'Last;

Tokenize chops up a string and returns tokens. It only processes a single token every loop, but this quantified expression was necessary to guarantee the type invariant for all the other token lexeme ranges during the loop:

(for all X in 1 .. Next_Index - 1 =>
    Tokens (X).Lexeme.Lower <= Tokens (X).Lexeme.Upper
    and then Tokens (X).Lexeme.Upper - Tokens (X).Lexeme.Lower <= Range_Size'Last)

I also had to rewrite some loops to ensure that the loop invariants are expressable. I struggled getting Tokenize to prove that it terminated and had to relax the bounds of the types I was using and use a side-effect producing function.

Most of Tokenize is the loop invariant.

function Tokenize (Self : in out Scanner) return Token_Array is
   Tokens     : Token_Array (1 .. 1024) := [others => <>];
   Next_Index : Positive := Tokens'First;
begin
   pragma Assert (Tokens'Length <= Positive'Last - 1);
   while Has_Next (Self) and then Next_Index <= Tokens'Last loop
      pragma
        Loop_Invariant
          (Is_Valid (Self)
             and then Remaining_Characters (Self)
                      <= Remaining_Characters (Self'Loop_Entry)
             and then (for all X in 1 .. Next_Index - 1
                       => Tokens (X).Lexeme.Lower <= Tokens (X).Lexeme.Upper
                          and then Tokens (X).Lexeme.Upper
                                   - Tokens (X).Lexeme.Lower
                                   <= Range_Size'Last));
      pragma Loop_Variant (Decreases => Remaining_Characters (Self));
      Next_Token (Self, Tokens (Next_Index));
      Next_Index := Next_Index + 1;
   end loop;
   return Tokens (1 .. Next_Index - 1);
end Tokenize;

Effects of Machine Arithmetic

The coolest part of this is that the results for the arithmetic operations of the calculator are checked! This actually guarantees things like the order of operands pulled from the stack in subtraction, which I originally did incorrectly.

-- Subtract the top two elements.
procedure Op_Subtract (Self : in out Machine)
with
  Global         => null,
  Pre            => Is_Running (Self),
  Contract_Cases =>
    (Stack_Size (Self) < 2 =>
       Status (Self) = Stack_Underflow
       and then Stack_Size (Self) = Stack_Size (Self'Old),
     others                =>
       ((Stack_Size (Self) = Stack_Size (Self'Old) - 1
         and then Peek (Self) = Peek (Self'Old, 1) - Peek (Self'Old, 0))
        or else (Self.Status = Value_Out_Of_Bounds)));

-- Duplicate the top stack element.
procedure Op_Dupe (Self : in out Machine)
with
  Global         => null,
  Pre            => Is_Running (Self),
  Contract_Cases =>
    (Is_Stack_Full (Self)  =>
       Status (Self) = Stack_Overflow
       and then Stack_Size (Self) = Stack_Size (Self'Old),
     Is_Stack_Empty (Self) =>
       Status (Self) = Stack_Underflow
       and then Stack_Size (Self) = Stack_Size (Self'Old),
     others                =>
       (Stack_Size (Self) = Stack_Size (Self'Old) + 1
        and then Peek (Self, 0) = Peek (Self'Old, 0)
        and then Peek (Self, 1) = Peek (Self'Old, 0)));

How was writing formally verified code?

Writing this project had huge hurdles before I started prefering expression functions and understood the interactions between loop invariants, type invariants and private subprograms.

The feedback loop of this method of programming is incredible. It felt very weird at first, but helped push me towards more conceptually consistent refactors which helped further checks. I had to put some hard limits in place for input lengths and number of tokens returned which I'd rather not.

Overall, this was a very enjoyable project and I'm seriously considering trying to fork and extend this into some small Forth interpreter.

You can find the project page here.