Property Based Testing

Hands-on tutorial on property-based testing.

The Big Idea

Instead of manually writing unit tests, think of a property that always holds and generate inputs at random.

More concretely, we iterate through these three steps.

Define a data generating process. It can be derived from types, or constraints, or constructed inside the test.
Define a predicate that holds for all such data. This is the “property” we are testing. For example, if + is commutative, a + b = b + a
Let the computer look for the smallest counterexamples.

Here are a few common properties.

Order of arguments doesn’t matter: a + b = b + a
Order of operations doesn’t matter: distinct(sort(x)) = sort(distinct(x))
Inverse cancels out: serialise(deserialise(x)) = x
Idempotency: distinct(distinct(x)) = distinct(x), sort(sort(x)) = sort(x)
Oracle: optimised vs naive implementation, multi-thread vs single-thread
Domain-specific invariants: sort doesn’t change size and contents, balanced trees are balanced

We can implement basic PBT in Python without any external libraries. For example, here is a test for commutativity of the + Python operator.

import random

def integers() -> int:
    """Generator."""
    return random.randint(-1e5, 1e5)

def addition_commutes(a: int, b: int) -> bool:
    """Property."""
    return a + b == b + a

def test_commutativity():
    for _ in range(100):
        a, b = integers(), integers()
        assert addition_commutes(a, b)

A dedicated PBT framework can help build generators, manage random seeds, use complex heuristics, and find smaller counterexamples.

Our Python implementation looks for any random counterexample, which may be very complex. It would be much nicer for the developer, if the counterexamples were as small as possible. This process is called “shrinking”, and is a pretty big research topic in PBT, as well as a popular feature in PBT frameworks.

Using Hypothesis

Hypothesis is a popular PBT library for Python, used among others by developers of NumPy, Pandas, and Stripe. In Hypothesis, we construct a strategy (the data generating process) which we supply to a test using @given.

from hypothesis import strategies, given

@given(strategies.integers(), strategies.integers())
def test_commutativity(a: int, b: int):
    assert addition_commutes(a, b)

We can specify explicit examples manually using @example. This way we can re-use our PBT suite for traditional unit tests. This can be also achieved with ddt. We could also use pytest.mark.parametrize. Hypothesis integrates with PyTest very well. There is a plugin for reporting on some test statistics.

from hypothesis import example

@example("from an example")
@given(strategies.just("from a strategy"))
def test_explicit_examples(x: str):
    """First runs explicit examples, then generates."""
    print(x)

Strategies convert random bytestrings into Python objects. This transformation is deterministic, so Hypothesis can save earlier errors as bytestrings, and report them to reproduce failures, without having to serialise anything.

Once Hypothesis finds a counterexample, it will try to mutate the corresponding bytestring into a shorter one, corresponding to a simpler object, for as long as it gets the same error. This process is called “shrinking”.

N = 817384

@given(strategies.integers())
def test_smaller_than(i: int):
    """Shrinking can find this number."""
    assert i < N

Lastly, we can control test runtime with @settings. For example, the default number of examples is 100, and the default deadline per example is 200ms. With derandomize, the test is seeded with a hash of the test function.

from hypothesis import settings

pattern = r"[abc][def][123]"

@settings(
    max_examples=5,
    deadline=None,
    database=None,
    derandomize=True,
    print_blob=True,
)
@given(strategies.from_regex(pattern, fullmatch=True))
def test_regex(s: str):
    assert any(l in s for l in pattern)

Constructing Strategies

Hypothesis has many built-in strategies we can use, such as integers, lists, or from_regex. There are also a few ways to compose more complex strategies.

Map and Filter

Every strategy has map and filter methods. We can also reject examples inside the test using assume. It’s preferable to build more specific strategies first, as rejecting from a more general strategy slows down the generator.

import datetime

def epochs():
    """Epoch timestamps at even seconds."""
    min_value = datetime.datetime(1999, 1, 1)
    max_value = datetime.datetime(2050, 1, 1)
    return (
        strategies.datetimes(min_value, max_value)
        .map(lambda dt: dt.timestamp())
        .filter(lambda ts: True)
    )

Builds

The builds strategy takes a function or class and recursively calls “up” the arguments, and finally infers a strategy from type hints.

With this, we can generate a strategy from a nested dataclass, as long as every type involved has a strategy to infer. We can register custom types to help in this inference, or override fields directly in builds.

from proto_tools.codec.entry import EntryCode

def entry_codes():
    """Instances of EntryCode."""
    return strategies.builds(EntryCode)

Composite

Another example is composite, the sledgehammer of strategies. We can imperatively draw individual values from other strategies and perform arbitrary operations between draws.

@strategies.composite
def entry_codes_prim(draw: strategies.DrawFn):
    """Imperatively edit every draw from entry_codes."""
    code = draw(entry_codes())
    if code.t_origin:
        code.t_origin = int(code.t_origin)
        code.t_origin += 1
    return code

Limitations & Extensions

We may have some code that fits PBT, but it may be too complex. With explosion of possible states, failure cases may be hard to reach with simple heuristics. While default settings in Hypothesis prioritise speed over power, there are a few ways to adjust this tradeoff.

Run the test for longer. We might run a separate job for a major release to run our PBT suite with different settings than in CI.
Targeted exploration. Maximise some scores inside the test. This may slow down the test a little, and might require more examples to make a difference, but it may explore the state space more efficiently.
Coverage-guided exploration. Let the computer maximise number of visited code branches (or lines, or scopes, depending on definition of coverage). Available for Hypothesis through HypoFuzz.
Stateful testing (or model-based testing, or state machine testing). Describe relations between components, mutate some shared state and let the framework look for sequences of events violating properties.

CI

We don’t want to share configuration locally and in CI. We want tests in CI to be deterministic (with seed explicitly set), contain explicit examples (with example) and run as as fast as possible.

While developing locally, we might want to have more options: same configuration as in CI to have quick feedback while making changes, but also an ability to run for longer when are not as confident.

Fuzzing

What is “fuzzing”? It’s an older term, originally related to looking for security bugs in programs. American Fuzzy Lop (AFL) pioneered coverage-guided fuzzing. Fuzzers usually don’t look for shrinked counterexamples.

That said, we can define a property as “this whole program doesn’t crash”. We can also express properties as exceptions inside a program. Fuzzers focus on many random payloads (often independent and attempted concurrently). PBT encourages testing more general properties, and performs this search not randomly, but adaptively.

Targeted Exploration

We can find a float that violates associativity. With target we tell the framework to maximise error between two floats, guiding it to find smallest examples with the biggest error. We also add an event to keep track of the statistics of the differences we see.

from hypothesis import event, target

floats = strategies.floats(0, 1e100)

@given(floats, floats, floats)
def test_associative_floats(a, b, c):
    """IEEE 754 floats are not associative."""
    x = (a + b) + c
    y = a + (b + c)
    diff = abs(x - y)
    event(diff)
    target(diff)
    assert diff == 0

Coverage-guided exploration

Given a complex program with many branches (think of “if” statements), we may want to maximise branch coverage from our generated tests.

import string

@given(strategies.text(alphabet=string.ascii_lowercase))
def test_branches(s: str):
    """Example of a bug hidden in the branches."""
    if len(s) == 4:
        if s[0] == "a":
            if s[1] == "s":
                if s[2] == "d":
                    if s[3] == "f":
                        raise RuntimeError(s)

Above test runs fine, but we won’t get to the RuntimeError with Hypothesis. Even after getting into the first branch, there are 26^4 combinations to go through. Turns out, using some coverage information, we can get there after checking only ~3.7% of the state space.

import pprint
from hypofuzz.hy import FuzzProcess

def fuzz_branches():
    """This will run until an exception occurs or we stop it."""
    process = FuzzProcess.from_hypothesis_test(test_branches)
    while True:
        pprint.pprint(process._json_description)
        process.run_one()

Stateful Testing

Instead of generating data, we may want to generate actions that hold some state. For example, we may define a simple model of a key-value store, define properties as invariants on both state and functions, and let Hypothesis find sequences that lead to inconsistent state.

Formal Verification?

You may be familiar with ways to formally prove correctness of code, like TLA+, known for it’s use at Amazon, or languages with strong enough type systems (Coq, Agda, Idris, Lean).

Some call PBT an example of a “lightweight formal method”. Of course, we can’t formally verifying anything with random inputs.