Hands-on tutorial on property-based testing.
Instead of manually writing unit tests, think of a property that always holds and generate inputs at random.
More concretely, we iterate through these three steps.
Here are a few common properties.
We can implement basic PBT in Python without any external libraries. For example, here is a test for commutativity of the + Python operator.
import random
def integers() -> int:
"""Generator."""
return random.randint(-1e5, 1e5)
def addition_commutes(a: int, b: int) -> bool:
"""Property."""
return a + b == b + a
def test_commutativity():
for _ in range(100):
a, b = integers(), integers()
assert addition_commutes(a, b)
A dedicated PBT framework can help build generators, manage random seeds, use complex heuristics, and find smaller counterexamples.
Our Python implementation looks for any random counterexample, which may be very complex. It would be much nicer for the developer, if the counterexamples were as small as possible. This process is called “shrinking”, and is a pretty big research topic in PBT, as well as a popular feature in PBT frameworks.
Hypothesis is a popular PBT library for Python, used among others by developers of NumPy, Pandas, and Stripe. In Hypothesis, we construct a strategy (the data generating process) which we supply to a test using @given.
from hypothesis import strategies, given
@given(strategies.integers(), strategies.integers())
def test_commutativity(a: int, b: int):
assert addition_commutes(a, b)
We can specify explicit examples manually using @example. This way we can re-use our PBT suite for traditional unit tests. This can be also achieved with ddt. We could also use pytest.mark.parametrize. Hypothesis integrates with PyTest very well. There is a plugin for reporting on some test statistics.
from hypothesis import example
@example("from an example")
@given(strategies.just("from a strategy"))
def test_explicit_examples(x: str):
"""First runs explicit examples, then generates."""
print(x)
Strategies convert random bytestrings into Python objects. This transformation is deterministic, so Hypothesis can save earlier errors as bytestrings, and report them to reproduce failures, without having to serialise anything.
Once Hypothesis finds a counterexample, it will try to mutate the corresponding bytestring into a shorter one, corresponding to a simpler object, for as long as it gets the same error. This process is called “shrinking”.
N = 817384
@given(strategies.integers())
def test_smaller_than(i: int):
"""Shrinking can find this number."""
assert i < N
Lastly, we can control test runtime with @settings. For example, the default number of examples is 100, and the default deadline per example is 200ms. With derandomize, the test is seeded with a hash of the test function.
from hypothesis import settings
pattern = r"[abc][def][123]"
@settings(
max_examples=5,
deadline=None,
database=None,
derandomize=True,
print_blob=True,
)
@given(strategies.from_regex(pattern, fullmatch=True))
def test_regex(s: str):
assert any(l in s for l in pattern)
Hypothesis has many built-in strategies we can use, such as integers, lists, or from_regex. There are also a few ways to compose more complex strategies.
Every strategy has map and filter methods. We can also reject examples inside the test using assume. It’s preferable to build more specific strategies first, as rejecting from a more general strategy slows down the generator.
import datetime
def epochs():
"""Epoch timestamps at even seconds."""
min_value = datetime.datetime(1999, 1, 1)
max_value = datetime.datetime(2050, 1, 1)
return (
strategies.datetimes(min_value, max_value)
.map(lambda dt: dt.timestamp())
.filter(lambda ts: True)
)
The builds strategy takes a function or class and recursively calls “up” the arguments, and finally infers a strategy from type hints.
With this, we can generate a strategy from a nested dataclass, as long as every type involved has a strategy to infer. We can register custom types to help in this inference, or override fields directly in builds.
from proto_tools.codec.entry import EntryCode
def entry_codes():
"""Instances of EntryCode."""
return strategies.builds(EntryCode)
Another example is composite, the sledgehammer of strategies. We can imperatively draw individual values from other strategies and perform arbitrary operations between draws.
@strategies.composite
def entry_codes_prim(draw: strategies.DrawFn):
"""Imperatively edit every draw from entry_codes."""
code = draw(entry_codes())
if code.t_origin:
code.t_origin = int(code.t_origin)
code.t_origin += 1
return code
We may have some code that fits PBT, but it may be too complex. With explosion of possible states, failure cases may be hard to reach with simple heuristics. While default settings in Hypothesis prioritise speed over power, there are a few ways to adjust this tradeoff.
We don’t want to share configuration locally and in CI. We want tests in CI to be deterministic (with seed explicitly set), contain explicit examples (with example) and run as as fast as possible.
While developing locally, we might want to have more options: same configuration as in CI to have quick feedback while making changes, but also an ability to run for longer when are not as confident.
What is “fuzzing”? It’s an older term, originally related to looking for security bugs in programs. American Fuzzy Lop (AFL) pioneered coverage-guided fuzzing. Fuzzers usually don’t look for shrinked counterexamples.
That said, we can define a property as “this whole program doesn’t crash”. We can also express properties as exceptions inside a program. Fuzzers focus on many random payloads (often independent and attempted concurrently). PBT encourages testing more general properties, and performs this search not randomly, but adaptively.
We can find a float that violates associativity. With target we tell the framework to maximise error between two floats, guiding it to find smallest examples with the biggest error. We also add an event to keep track of the statistics of the differences we see.
from hypothesis import event, target
floats = strategies.floats(0, 1e100)
@given(floats, floats, floats)
def test_associative_floats(a, b, c):
"""IEEE 754 floats are not associative."""
x = (a + b) + c
y = a + (b + c)
diff = abs(x - y)
event(diff)
target(diff)
assert diff == 0
Given a complex program with many branches (think of “if” statements), we may want to maximise branch coverage from our generated tests.
import string
@given(strategies.text(alphabet=string.ascii_lowercase))
def test_branches(s: str):
"""Example of a bug hidden in the branches."""
if len(s) == 4:
if s[0] == "a":
if s[1] == "s":
if s[2] == "d":
if s[3] == "f":
raise RuntimeError(s)
Above test runs fine, but we won’t get to the RuntimeError with Hypothesis. Even after getting into the first branch, there are 26^4 combinations to go through. Turns out, using some coverage information, we can get there after checking only ~3.7% of the state space.
import pprint
from hypofuzz.hy import FuzzProcess
def fuzz_branches():
"""This will run until an exception occurs or we stop it."""
process = FuzzProcess.from_hypothesis_test(test_branches)
while True:
pprint.pprint(process._json_description)
process.run_one()
Instead of generating data, we may want to generate actions that hold some state. For example, we may define a simple model of a key-value store, define properties as invariants on both state and functions, and let Hypothesis find sequences that lead to inconsistent state.
You may be familiar with ways to formally prove correctness of code, like TLA+, known for it’s use at Amazon, or languages with strong enough type systems (Coq, Agda, Idris, Lean).
Some call PBT an example of a “lightweight formal method”. Of course, we can’t formally verifying anything with random inputs.