Reproducing a Memory-Tracking Bug with TLA+

in #programming6 years ago

I was interested in a first project for getting familiar with TLA+, “a high-level language for modelling programs and systems.” TLA+ has been used to find errors in the design of real-world distributed systems, so I was curious how it compared to model checkers I have previously used such as Alloy. A co-worker recently tracked down a concurrency bug that was causing some test failures. I was interested in seeing if I could model the system with TLA+, and reproduce the failure using its consistency checker, TLC.

The system where the failure occurs is a wrapper around memory allocation. It tracks, per allocation, which “phase” performed the allocation. A phase is just a lightweight replacement for a full backtrace; it’s an identifier representing a particular section of code, maintained in a per-thread stack. In the failing tests, memory usage was reported per-phase by walking a list of allocated memory items and summing their sizes. This process asserted because it found an allocation marked free within the list.

The bug was a simple ordering problem. While the deallocation code should have removed the memory from a per-phase list first, then marked it free, it did the reverse:

wrap_free( void *m ) {
  Wrapper *p = convert_to_wrapper( m );
  p->state = FREE;
  PhaseList *l = get_list( p->phase );

  lock( l->mutex );
  remove_from_list( l, p );
  unlock( l->mutex );
}

The complete model I wrote can be found here:

Modelling allocation and deallocation

In order to demonstrate the concurrency problem, the allocate and deallocate operations have to be split into multiple actions. Otherwise, the model will act as if the entire sequence of events was an atomic operation. But, we don’t need to explicitly represent the memory allocator itself, or the operation of the mutex. Instead, i chose to represent each operation as the portion outside the mutex, and the critical section inside the mutex. In the model these are the actions AllocNoncritical, AllocCritical, FreeNonCritical, and FreeCritical.

I also had to decide what portion of the state is worth including. I chose to represent a memory allocation as a record with three elements: a memory address (to ensure uniqueness), a phase, and a state (free, allocated, or none.)

States == { "alloc", "free", "none" }
MemoryAllocation == [ addr : Address, phase: Phase, state : States ]

Address and Phase are “constants” of the model, which in TLA+ terms means they could be any object at all. When it comes time to check the model with TLC, a specific set of “model objects” are plugged in to make the model finite. The third constant in the model is a set of threads:

CONSTANTS Phase, (* A set of phases *)
          Thread, (* A set of threads *)
          Address (* A set of addresses *)

A point in time of a TLA+ model corresponds to an assignment of values to variables. The variables may (or may not!) change at each step in the evolution of the model. I used one variable to represent a set of allocations “in use”, a second variable to track the per-phase lists, and the third variable to track the local variable inside a particular thread.

VARIABLES memory, (* All in-use allocations *)
          memoryByPhase, (* A mapping from phases to sets of allocated items *)
          memoryByThread (* A mapping from threads to an allocated record *)         

I chose to have the thread state be represented by a MemoryAllocation as well, which may have been a questionable choice. That is why a memory allocation may have a “none” state to represent a thread not currently engaged in an allocation or deallocation. I represented that as a special MemoryAllocation, one which all threads start with. But because the record needs an address and phase, I asked TLA to pick one arbitrarily:

DontCareAddress == CHOOSE a \in Address : TRUE
DontCarePhase == CHOOSE p \in Phase: TRUE
NoAllocation == [ addr |-> DontCareAddress, phase |-> DontCarePhase, state |-> "none" ]
 
Init ==
  /\ memory = {}
  /\ memoryByPhase = [ p \in Phase |-> {} ]
  /\ memoryByThread = [ t \in Thread |-> NoAllocation ]

We can also write a predicate that checks whether the variables have the correct type:

TypeInvariant == 
  /\ memory \in SUBSET MemoryAllocation
  /\ memoryByPhase \in [Phase -> SUBSET MemoryAllocation] (* Each phase has a list, but order not important *) 
  /\ memoryByThread \in [Thread -> MemoryAllocation] (* Each thread can only be doing one allocation or deallocation at a time *)

Now we can write the actions in terms of the current state (unprimed variables) and the next state (primed variables). The action of a particular thread t allocating memory in phase p is represented by the following action:

 (* Non-critical section: allocate some memory from the heap *)       
AllocNoncritical(t,p) == 
  /\ memoryByThread[t].state = "none"
  /\ \E m \in MemoryAllocation :
        /\ m.state = "alloc"
        /\ m.phase = p
        /\ AddressNotInUse(m)  
        /\ memoryByThread' = [ memoryByThread EXCEPT ![t] = m ]
  /\ UNCHANGED <<memoryByPhase, memory>>
        
(* Critical section; add it to the list for the phase it was allocated (and return it for use.) *)        
AllocCritical(t) == 
  LET m == memoryByThread[t] IN
    /\ m.state = "alloc"
    /\ memoryByPhase' = [ memoryByPhase EXCEPT ![m.phase] = memoryByPhase[m.phase] \union { m } ]
    /\ memoryByThread' = [ memoryByThread EXCEPT ![t] = NoAllocation ]
    /\ memory' = memory \union { m }

In TLA+ you can’t “modify” a set or mapping. You instead have to provide a replacement set or mapping for the new time step. The EXCEPT syntax is a way to take a function and provide a new one that’s the same everywhere but one value.

The use of memoryByThread ensures that a particular thread must “execute” AllocNoncritical first (so that the memory it points to is no longer a “none”), and cannot execute AllocNoncritical a second time without going through AllocCritical. This is one way to represent the flow of execution. In practice, it’s probably easier to write the model in PlusCal instead which provides a more natural way of representing a sequence of operations.

Writing the FreeCritical action is so painful that it screams there is a problem here, even before running the model checker. Again, there’s no sense in TLA+ of “changing the MemoryAllocation object”. The record isn’t a variable; it’s not mutable. We have to replace the MemoryAllocation, wherever it occurs, with a replacement record:

FreeNoncritical(t,m) ==
  LET m2 == [ addr |-> m.addr, state |-> "free", phase |-> m.phase ] IN
    /\ m \in memory
    /\ memoryByThread[t].state = "none"
    /\ memoryByThread' = [memoryByThread EXCEPT ![t] = m2]
    /\ memoryByPhase' = [memoryByPhase EXCEPT ![m.phase] = (memoryByPhase[m.phase] \ {m}) \union {m2} ]
    /\ memory' = (memory \ { m }) \union { m2 } 

We finish the specification by combining all the actions and the initial state, along with the temporal operator [] which means “always”, into the top-level specification:

Next == 
 \/ \E t \in Thread, p \in Phase : AllocNoncritical(t, p)
 \/ \E t \in Thread : AllocCritical(t)
 \/ \E t \in Thread, m \in MemoryAllocation : FreeNoncritical(t, m)
 \/ \E t \in Thread : FreeCritical( t )

vars == <<memory,memoryByPhase,memoryByThread>>
    
Spec == Init /\ [][Next]_vars

Checking the model

In the original code, the assertion fired when an element in the per-phase list was not listed as allocated. So, our assertion about the behavior of the model will required that every memory allocation, on every per-phase list, is in the allocated state:

AllAlloced == \A p \in Phase : \A m \in MemoryAllocation : m \in memoryByPhase[p] => m.state = "alloc"

Or, in TLA+’s neatly typeset form:

tla-memory-3.png

To perform an exhaustive search for violations of this invariant, we need to fill in the model’s constants with finite sets. I arbitrarily picked five different addresses, three phases, and two threads. Each of the sets is treated as “symmetrical” which means that the checker can test just one permutation of the elements.

tla-memory-1.png

TLC quickly finds and flags the error:

Invariant AllAlloced is violated

and provides an explorable time sequence that lead to the state where the invariant is false:
tla-memory--2.png

Pitfalls

The first version of the invariant, I wrote like this:

AllAlloced == \A p \in Phase : \A m \in memoryByPhase[p] : m.state = "alloc"

and this is valid TLA+. However, due to other errors in the specification, TLC complained that it couldn’t enumerate all the values of m; it believed that was an infinite set. Unfortunately, “unable to enumerate” is a frequent outcome from errors in the specification.

For example, I got the syntax wrong for records; instead of

NoAllocation == [ addr |-> DontCareAddress, phase |-> DontCarePhase, state |-> "none" ]

I used

NoAllocation == [ addr : DontCareAddress, phase : DontCarePhase, state : "none" ]

which produces the following error:

Attempted to enumerate a set of the form [l1 : v1, ..., ln : vn],
but can't enumerate the value of the `addr' field:
a1
While working on the initial state:
/\ memoryByThread = (t1 :> [addr: a1, phase: p1, state: "none"] @@ t2 :> [addr: a1, phase: p1, state: "none"])
/\ memoryByPhase = (p1 :> {} @@ p2 :> {} @@ p3 :> {})
/\ memory = {}

This led me to make a lot of changes to try to work around the problem: was I setting up model variables incorrectly? Was my weirdo use of CHOOSE the problem? Should my strings be constants as well?

Part of the problem is that the initial state looks OK, to a novice. It’s got the structures I expected to see, but those aren’t structures, they are sets. The real initial state looks like this:

/\  memory = {}
/\  memoryByPhase = (p1 :> {} @@ p2 :> {} @@ p3 :> {})
/\  memoryByThread = ( t1 :> [addr |-> a1, phase |-> p1, state |-> "none"] @@
  t2 :> [addr |-> a1, phase |-> p1, state |-> "none"] )

A further pitfall was that “Specifying Systems”, the authoritative book on TLA+, quickly abandons using the input format in favor of nicely-typeset specifications. This made it hard to reproduce a construct introduced in the text, without referring elsewhere to an example specification. For example, tuples are enclosed in << >> but are typeset with a single angle bracket. Perhaps a Detexifier for TLA+ would help.

Conclusions

My experience getting a toy specification written in TLA+ was much more difficult than other model-checking systems I have previously used, such as Alloy or SPIN. Probably if I had gotten Hillel Wayne’s Practical TLA+ first instead of Specifying Systems, I would have done this exercise is PlusCal instead which may have been an easier path.

The model checker did, as expected, find the bug my co-worked discovered. However, the exercise of writing the specification would probably have been, by itself, sufficient warning that the order of operations was incorrect--- assuming that the model clearly separated out the mutex-protected and non-mutex-protected sections as I did here.

I think there is a lot of opportunity to improve on the TLA+ and TLC experience. Better error messages would help (see https://groups.google.com/d/msg/tlaplus/w3OCMI-5Dz4/sUV9M7_njykJ for an even worse experience than mine, but from 2013.) The TLA toolbox does a good job of packaging everything up so that it works out of the box, but from a usability perspective the maze of tabs and sub-windows doesn’t feel very modern, and sometimes errors were hidden as pop-ups behind tiny icons.

Model-checking also felt like a very manual process rather than one where there were sensible defaults. For example, in Alloy you can specify the maximum size of the model in terms of number of objects. You can also ask Alloy “show me an example of X” and get a nice graphical representation, which I found lacking when trying to understand an issue where memory wasn’t being successfully allocated. TLC can function as a “calculator” for TLA+, which is useful, but it’s far from a smooth, interactive experience, at least through Toolbox.

Further reading;

Fuzz.ai

Fuzz.ai is an early-stage startup dedicated to making software correctness tools easier to use. Fuzzers, model checkers, and property-based testing can make software more robust, expose security vulnerabilities, and speed development.