# Property Based Testing Hands-on tutorial on property-based testing. ## The Big Idea > Instead of manually writing unit tests, > think of a property that always holds > and generate inputs at random. More concretely, we iterate through these three steps. 1. Define a data generating process. It can be derived from types, or constraints, or constructed inside the test. 2. Define a predicate that holds for all such data. This is the "property" we are testing. For example, if + is commutative, a + b = b + a 3. Let the computer look for the smallest counterexamples. Here are a few common properties. - Order of arguments doesn't matter: a + b = b + a - Order of operations doesn't matter: distinct(sort(x)) = sort(distinct(x)) - Inverse cancels out: serialise(deserialise(x)) = x - Idempotency: distinct(distinct(x)) = distinct(x), sort(sort(x)) = sort(x) - Oracle: optimised vs naive implementation, multi-thread vs single-thread - Domain-specific invariants: sort doesn't change size and contents, balanced trees are balanced We can implement basic PBT in Python without any external libraries. For example, here is a test for commutativity of the `+` Python operator. ```python import random def integers() -> int: """Generator.""" return random.randint(-1e5, 1e5) def addition_commutes(a: int, b: int) -> bool: """Property.""" return a + b == b + a def test_commutativity(): for _ in range(100): a, b = integers(), integers() assert addition_commutes(a, b) ``` A dedicated PBT framework can help build generators, manage random seeds, use complex heuristics, and find smaller counterexamples. > Our Python implementation > looks for any random counterexample, > which may be very complex. > It would be much nicer for the developer, > if the counterexamples were as small as possible. > This process is called "shrinking", > and is a pretty big research topic in PBT, > as well as a popular feature in PBT frameworks. ## Using Hypothesis [Hypothesis] is a popular PBT library for Python, used among others by developers of [NumPy], [Pandas], and [Stripe]. In Hypothesis, we construct a strategy (the data generating process) which we supply to a test using `@given`. [hypothesis]: https://hypothesis.readthedocs.io [numpy]: https://numpy.org/doc/stable/reference/testing.html#tests-on-random-data [pandas]: https://github.com/pandas-dev/pandas/blob/main/pandas/_testing/_hypothesis.py [stripe]: https://hypothesis.readthedocs.io/en/latest/endorsements.html#id1 ```python from hypothesis import strategies, given @given(strategies.integers(), strategies.integers()) def test_commutativity(a: int, b: int): assert addition_commutes(a, b) ``` We can specify explicit examples manually using `@example`. This way we can re-use our PBT suite for traditional unit tests. This can be also achieved with `ddt`. We could also use `pytest.mark.parametrize`. Hypothesis integrates with PyTest very well. There is a [plugin] for reporting on some test statistics. [plugin]: https://hypothesis.readthedocs.io/en/latest/supported.html#testing-frameworks ```python from hypothesis import example @example("from an example") @given(strategies.just("from a strategy")) def test_explicit_examples(x: str): """First runs explicit examples, then generates.""" print(x) ``` Strategies convert random bytestrings into Python objects. This transformation is deterministic, so Hypothesis can save earlier errors as bytestrings, and report them to reproduce failures, without having to serialise anything. Once Hypothesis finds a counterexample, it will try to mutate the corresponding bytestring into a shorter one, corresponding to a simpler object, for as long as it gets the same error. This process is called "shrinking". ```python N = 817384 @given(strategies.integers()) def test_smaller_than(i: int): """Shrinking can find this number.""" assert i < N ``` Lastly, we can control test runtime with `@settings`. For example, the default number of examples is 100, and the default deadline per example is 200ms. With `derandomize`, the test is seeded with a hash of the test function. ```python from hypothesis import settings pattern = r"[abc][def][123]" @settings( max_examples=5, deadline=None, database=None, derandomize=True, print_blob=True, ) @given(strategies.from_regex(pattern, fullmatch=True)) def test_regex(s: str): assert any(l in s for l in pattern) ``` ## Constructing Strategies Hypothesis has many built-in strategies we can use, such as `integers`, `lists`, or `from_regex`. There are also a few ways to compose more complex strategies. ### Map and Filter Every strategy has `map` and `filter` methods. We can also reject examples inside the test using `assume`. It's preferable to build more specific strategies first, as rejecting from a more general strategy slows down the generator. ```python import datetime def epochs(): """Epoch timestamps at even seconds.""" min_value = datetime.datetime(1999, 1, 1) max_value = datetime.datetime(2050, 1, 1) return ( strategies.datetimes(min_value, max_value) .map(lambda dt: dt.timestamp()) .filter(lambda ts: True) ) ``` ### Builds The `builds` strategy takes a function or class and recursively calls "up" the arguments, and finally infers a strategy from type hints. With this, we can generate a strategy from a nested dataclass, as long as every type involved has a strategy to infer. We can register custom types to help in this inference, or override fields directly in `builds`. ```python from proto_tools.codec.entry import EntryCode def entry_codes(): """Instances of EntryCode.""" return strategies.builds(EntryCode) ``` ### Composite Another example is `composite`, the sledgehammer of strategies. We can imperatively draw individual values from other strategies and perform arbitrary operations between draws. ```python @strategies.composite def entry_codes_prim(draw: strategies.DrawFn): """Imperatively edit every draw from entry_codes.""" code = draw(entry_codes()) if code.t_origin: code.t_origin = int(code.t_origin) code.t_origin += 1 return code ``` ## Limitations & Extensions We may have some code that fits PBT, but it may be too complex. With explosion of possible states, failure cases may be hard to reach with simple heuristics. While default settings in Hypothesis prioritise speed over power, there are a few ways to adjust this tradeoff. 1. Run the test for longer. We might run a separate job for a major release to run our PBT suite with different settings than in CI. 2. Targeted exploration. Maximise some scores inside the test. This may slow down the test a little, and might require more examples to make a difference, but it may explore the state space more efficiently. 3. Coverage-guided exploration. Let the computer maximise number of visited code branches (or lines, or scopes, depending on definition of coverage). Available for Hypothesis through [HypoFuzz]. 4. Stateful testing (or model-based testing, or state machine testing). Describe relations between components, mutate some shared state and let the framework look for sequences of events violating properties. [hypofuzz]: https://hypofuzz.com ### CI We don't want to share configuration locally and in CI. We want tests in CI to be deterministic (with `seed` explicitly set), contain explicit examples (with `example`) and run as as fast as possible. While developing locally, we might want to have more options: same configuration as in CI to have quick feedback while making changes, but also an ability to run for longer when are not as confident. ### Fuzzing What is "fuzzing"? It's an older term, originally related to looking for security bugs in programs. American Fuzzy Lop (AFL) pioneered coverage-guided fuzzing. Fuzzers usually don't look for shrinked counterexamples. That said, we can define a property as "this whole program doesn't crash". We can also express properties as exceptions inside a program. Fuzzers focus on many random payloads (often independent and attempted concurrently). PBT encourages testing more general properties, and performs this search not randomly, but adaptively. -
-
-
### Targeted Exploration We can find a float that violates associativity. With [target] we tell the framework to maximise error between two floats, guiding it to find smallest examples with the biggest error. We also add an event to keep track of the statistics of the differences we see. [target]: https://hypothesis.readthedocs.io/en/latest/details.html#hypothesis.target ```python from hypothesis import event, target floats = strategies.floats(0, 1e100) @given(floats, floats, floats) def test_associative_floats(a, b, c): """IEEE 754 floats are not associative.""" x = (a + b) + c y = a + (b + c) diff = abs(x - y) event(diff) target(diff) assert diff == 0 ``` ### Coverage-guided exploration Given a complex program with many branches (think of "if" statements), we may want to maximise branch coverage from our generated tests. ```python import string @given(strategies.text(alphabet=string.ascii_lowercase)) def test_branches(s: str): """Example of a bug hidden in the branches.""" if len(s) == 4: if s[0] == "a": if s[1] == "s": if s[2] == "d": if s[3] == "f": raise RuntimeError(s) ``` Above test runs fine, but we won't get to the RuntimeError with Hypothesis. Even after getting into the first branch, there are 26^4 combinations to go through. Turns out, using some coverage information, we can get there after checking only ~3.7% of the state space. ```python import pprint from hypofuzz.hy import FuzzProcess def fuzz_branches(): """This will run until an exception occurs or we stop it.""" process = FuzzProcess.from_hypothesis_test(test_branches) while True: pprint.pprint(process._json_description) process.run_one() ``` ### Stateful Testing Instead of generating data, we may want to generate actions that hold some state. For example, we may define a simple model of a key-value store, define properties as invariants on both state and functions, and let Hypothesis find sequences that lead to inconsistent state. ## Formal Verification? You may be familiar with ways to formally prove correctness of code, like [TLA+], known for itโs use at Amazon, or languages with strong enough type systems ([Coq], Agda, Idris, [Lean]). [tla+]: https://en.wikipedia.org/wiki/TLA%2B [coq]: https://en.wikipedia.org/wiki/Coq [lean]: https://leanprover-community.github.io/ Some call PBT an example of a ["lightweight formal method"][lfm]. Of course, we can't formally verifying anything with random inputs. [lfm]: https://www.cis.upenn.edu/~bcpierce/papers/Yow2022-Pierce.pdf "slide 16"