How to Write Simple Tests that Scale with Codebases, Organizations, and Changing Requirements

DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)

Testing is a chronic pain point and target of discussion in software engineering. I've recently been in another round of discussions about how to write tests for large systems at work and best practices for orgs / teams. Here I wanted to share some of these thoughts to spread the discussion and battle-harden some of these ideas.

I personally think I'm pretty lax on most coding philosophies - if what you wrote is a Simple Scalable System solving the problem at hand then it gets a pass from me. But I've also noticed many patterns of failure over my 7 year career working as a software engineer at companies large and small.

Of note are common patterns of testing that end up being complex, hard to reason about, and don't scale well for inevitable changing requirements. Worse we get 1000s of tests but the system remains brittle and no one can understand whether tested requirements are useful or not - basically an untested system with extra steps.

Here I'm going to share some observations about testing and some tactics I've found helpful in doing this better at scale. I'm not saying this is the "right" way - there is no right way - but I am saying I've observed these practices to work better than other alternatives in more situations.

Want to support this blog? Fun() tshirts now available on HAMY SHOP.

Why do we Test?

The goal of a test is to make sure a system does what it's supposed to. A system only exists to produce some impact in the outside world. This means what we care about (and by extension what we test) is - given x inputs, expect y outputs.

Testing is useful for businesses / orgs / eng teams for several reasons:

Ensure System follows business rules - If we say we're going to charge you $x and the system charges you $y instead -> that's a problem. The only way to prove these business rules at scale is to have automated proofs - testing.
Provide Accurate Documentation - Aside from proving the system follows the rules, tests also serve as a source of truth for what these rules are. Every large business likely has dozens of documents outlining a single business system - artifacts from previous planning sessions, improvements, and projects. At the end of the day the system is what is actually running the business so the system is the source of truth for how this process actually works.
Enable faster, safer changes to systems - Software systems are really just virtual buildings. Tests prove that the external requirements are still met which allows us to change how the building is constructed without ruining everything inside the building. If we don't have tests / tests aren't proving our system well - it's easy to accidentally remove the foundation / roof / stairs / etc from the building and not notice til later when the system crashes. Having confidence that a random change doesn't destroy the building entirely helps to allow for faster system changes without destroying everything.

For a further exploration of the software as a virtual building metaphor - see: Why Type-safe Programming Languages are better than Dynamic and Lead to Faster, Safer Software at Scale

Of particular note - the outside world does not care about how this x -> y conversion takes place, just that it takes place within their expected constraints.

This makes sense. Let's say there's a vending machine that takes $x and gives y drink. As a customer if I pay $x I expect y drink. I don't care how that drink gets to me, it just needs to get to me (and probably do so fast / unbroken / etc).

This means it doesn't matter what the mechanism is to provide drink. It could be a robot. It could be a human. It could be a Lovecraftian horror. It doesn't matter - I paid $x, I need y drink.

This may seem like a minor point but this is where I see a lot of tests go wrong - they test for the implementation and not the behavior. This often leads to dozens of tests for a given system that are enforcing that robot / human / horror is doing something but often misses the simple fact that what we care about is $x -> y drink.

Moreover it makes each inevitable requirements change harder to make. If product decides to go from robot -> horror now all the tests are broken. Is the vending machine actually broken though? Unclear cause first we gotta change all the internal implementation tests to pass then look at the behavioral tests. But where were the behavioral tests again? Ah I can't find them because they're hidden amongs the dozens of implementation tests. Oh finally found them - wait are they still testing the right thing? I just changed a ton of tests to get them to pass - was that behavioral or implementation?

Okay this is a bit contrived and extreme but this happens so frequently that frankly I'm frustrated we keep doing this to ourselves.

So to sum up - the reason we write tests is to ensure the system produces the impact it's supposed to (it follows the requirements). A system's impact is its behavior - NOT its implementation. So to test the impact of a sytem we need to be testing its behavior - given x, output y.

What does it mean for a Test to be Scalable?

I like to build Simple Scalable Systems. This is because:

If it's not Simple - Hard to reason about thus hard to maintain / change over time. Change is inevitable and most system engineering is maintenance so !simple generally means worse system.
If not Scalable - Breaks when we scale up / down or requirements inevitably change. Change is inevitable so if it doesn't scale, it's not a long-term solution.
If not Systemic - We can't reuse the system / its philosophy so we're doing a lot of one-off work. This is okay sometimes but if you see common problems / scenarios it's often useful to invest in a common solution so you can spend your innovation budget elsewhere.

If the goal of a test is to ensure our system is doing what it's supposed to, the test needs to take a form that it can change along with the system it tests. Ideally we want the test to require minimal extra effort (unnecessary complexity) on top of the primary system to handle these changes while still maintaining correctness.

Common changes that tests need to scale for:

Understanding current functionality (inputs, outputs)
New requirements - new / different (inputs, outputs)
New Implementation detail changes (robot -> horror, vendor A -> vendor B)
New devs reading / changing / debugging system for first time (inputs, outputs)
Product people trying to understand existing system (Requirements vs inputs, outputs)

For real-world testing purposes we have several dimensions of scalability that I think are important to handle these scenarios:

Codebase size - The larger the codebase is the less likely a single person understands all the requirements and thus the more important it is for these requirements to be understandable and enforced in code. At the end of the day the code is the source of truth so the more readable it is (and tests are requirements documentation!) the better we scale to larger codebases (which really just means more systemized requirements).
Organization size - The larger the org is the less likely a single person understands the full product / domain requirements and the more important it is the code itself acts as source of truth for docs / enforcement. Moreover the larger the org is, often the more people are making changes so if the source of truth is not clear - it's more likely for people to make seemingly innocuous changes that break the whole system.
Requirement Change - Change is inevitable. If your tests cannot easily change with the underlying system then you are making more work for yourself in the future. More generally this leads to more cases of 1) NOT adding new tests so less requirements are actually tested / proven and our systems slowly becomes less enforced or 2) we keep adding more tests in a non-scalable manner leading to dozens of tests for a given feature but no easy way to see what exactly this thing is doing without reading through, understanding these 100s, 1000s of lines of test code.

Some strategies I've found that generally help with test scalability:

Data-driven tests

Usually several testcases that contain (inputs, outputs) for a given subsystem. This makes it easy to see what this system's contract is - what it currently does (source of truth), how to add / modify / remove tests cases, and how to change the contract entirely (changing input, output composition).

Moreover this helps to avoid what I consider "false" tests - these are often tests proving one particular scenario of a system that look like they're proving a particular variable leads to a paricular outcome when actually they're load-bearing on another variable. Forcing multiple test cases helps to prove that this variable is indeed the controlling factor - think of it like a null case in an experiment. (I see these "false" tests very often in codebases - I would posit it's common for 30% of tests do this).

Test behavior of subsystem (not internal implementations)

This is not a hard and fast rule but ask yourself if each test is enforcing behavior or implementation. Sometimes it's useful to test internal implementations if that part itself is a subsystem that would benefit from behavior assertions (like a particularly complex math calculation). But generally implementation testing is useless without behavior testing (cause bugs from system interaction are usually harder to catch) so often you're better off starting there.

You can have 100%, even 1000% coverage (covering same code 10x) of your implementations but if the system behavior is still wrong then all of that coverage is useless. (Examples where this fail is that each implementation is proven to work in isolation but $x -> y is still broken. This often happens when we have implicit logic - like untyped returns, exception control flows, data mutations etc - that happens IRL but doesn't occur / isn't tested in isolation).

Good rule of thumb to see if you're testing subsystem behavior or not is to ask yourself if this is a public API / interface of the system. If what you're testing is not / should not be used by end users (whether a customer or a team using your function / class / library) then this is probably an implementation detail. If it is surfaced then it is most likely a behavior that users could depend on and thus is probably something useful to test.

Avoid mocks

Mocks are lies. You want to avoid mocks as much as possible. Mocks do not behave like production systems so what are you actually testing? A lie. Every time you use a mock you are testing another lie. This leads to you testing more and more lies and eventually you aren't even testing the system just a bunch of assumptions. And assumptions are often wrong. So avoid mocks (and lies) as much as possible.

Now there are some cases where mocks are useful (like calling external APIs) but you should always think critically about whether this mock is helping the test (by avoiding irrelevant setup / baggage) or hurting the test (by missing potentially crucial edge / failure cases you won't see til prod).

How to Write Simple Scalable Tests

Based on these observations, I've coalesced on a general pattern of testing that I've found works well across many scenarios at many dimensional scales. There are certainly other ways of testing and this will not be the best method in all scenarios but it's the best Simple Scalable System I've found for testing thus far and what I use most often so I'll share it here.

A test has 3 stages:

Arrange - Setup your data (usually test cases, inputs, and any data model / mocks the system requires)
Act - Call your subsystem (this can almost always be 1-few function calls if you're testing a subsystem's behavior)
Assert - Ensure the expected behaviore happened (If pure Act look at its returns, if impure Act look at expected mutations)

I even go so far as to comment out these sections because I find it helps keep tests organized and manageable which generally helps it remain a clear proof rather than a glob of extra code no one understands.

If you're testing behavior of a subsystem the test body can almost always be pretty short and straight forward. This is because the behavior itself should be pretty deterministic - given x, do y. (It's literally just a list of the requirements for the system.)

If your test is not this straight forward or your behavior is not deterministic you should really think about why. This is often a sign that your system itself is not operating in a Simple, Scalable way and could benefit from a refactor. In many cases this is due to implicit mutations / dependencies or this system is not modeling the domain effectively - focused on a bunch of technical operations instead of the behavior the system needs it for.

(Aside: The whole basis of Domain-Driven Design is to get systems to more closely resemble the real world the system is built / modeled for so that we can eliminate unnecessary complexity caused by these conceptual gaps between system <> reality. For more: here's the best book I've read on Domain-Driven Design.)

I'm not a strong proponent for TDD - I like testing but I often write my tests after I have a general sense of what my system is going to look like so I'm more build then test then refactor. But I do think that testing is super important to do while you're building up your system. Testing is often the first time your system actually gets used by anyone and therefore the first signal you can get about how this system actually works e2e. So if you - the developer of the system (and defacto expert) - cannot easily use your system in a test then you must expect that it is magnitudes harder for some other developer to use successfully in production.

Don't ignore this signal! It is almost always telling you something useful.

How I write tests

Okay now let's look at what an example test following these principles might look like. Note I'm using pseudo code here because the code itself doesn't actually matter - it's the logical setup.

This is similar to the scientific method - it doesn't prescribe a specific implementation but it provides a framework for how to run experiments (tests!) to check (input, outputs) while minimizing noise from external factors. So here I'm providing a mechanism for structuring your experiments to try and minimize complexity while maximizing scalability and correctness.

This approach is simple:

Clear (input, output) definition for testcases - Product most often gives requirements as a list of (inputs, outputs) so defining our system requirements in a similar fashion helps reduce system<>reality drift.
List of requirements via Data-driven test cases - so we can read / understand current behavior, add / remove / modify test cases, and change (input, output) definitions all in one place. This is how we get a nice source-of-truth documentation set for what the system actually does in prod.
Simple test body - clear test steps which allows for deterministic arrange, act, assertions across ALL test cases. This is v important for running experiments - from science class we learned it's easy to accidentally introduce external factors that corrupt the outcomes of the experiment. The simpler the test body, the less unknown factors can be introduced. This avoids the very common issue of "false" tests - dozens of tests each w different setup / teardowns so it's unclear what exactly is being tested - what variables lead to what outcomes.

TestCase_ExampleTest:
* name: string
* intVar: int
* expectReturn: bool

let allTestCases = [
	TestCase_ExampleTest(
		name="Failure - 1",
		intVar=1,
		expectReturn=False
	),
	TestCase_ExampleTest(
		name="Pass - 0",
		intVar=0,
		expectReturn=True
	),
]

let testShowExample = fun() ->
	for testCase in allTestCases:
		# Arrange
		
		# ... do your setup here
		
		# Act 
		
		let result = example(testCase.intVar)
		
		# Assert
		
		assert result == testCase.expectReturn

This pseudo code can be transalated into any programming language with any testing framework. It does not rely on any particular dependencies and is easy to scale up or down in any dimensional scale.

How does this scale?

New requirements? Modify the test cases. Maybe even change the arrange / act / assert clauses (though frequently not necessary if you're doing data-driven behavioral tests)
Understanding existing requirements? Just read the test case behavior. The proofs are all there, in one place.
New implementation details? We are testing behavior not implementation so if implementation leads to same behavior we're good! Different behavior? Are you sure that's what the requirements ask for? If so - change the test cases. At least now you know if you've broken the external behavior.

Testing like this is too hard for my system!

Don't worry - you're not alone! This is the most frequent complaint I get when I introduce testing like this. It's fair - it is hard to test many systems in this manner. Plus this is likely different from how you've written many tests in your career.

So I see you. You're right. The complaint is valid.

However I would argue that this is a red flag for the system, not the methodology. If you cannot test the system - how do we know it's doing what it's supposed to? If we don't know it's doing what it's supposed to, how likely is it that it's actually doing that?

If it's hard for you to test a system - it's probably even harder for someone else to use that system in prod. If we can't use that system correctly in prod then it's likely it will be misused instead. So if we cannot test that a system does what it's supposed to we kinda need to assume it doesn't do that.

So if you have a system like this - it might be worth thinking about changing it.

When we build systems that are easy to test we get a lot of benefits:

More confident system aligns with requirements
More confident about changes we make - faster changes, less fires / rework
Testable systems are almost always more composable - easier to use, more flexible for more cases, easier to reuse

What makes systems more testable is often that the components under test are more composable. This means that the pieces are easier to control via inputs with more deterministic outputs which also makes this piece easier to use and reuse in other parts of the system.

This means that these more testable components are almost always easier to use / reuse in prod, easier to evolve over time, and we can do so with confidence because they're better tested. This really is the root of what makes "good" software. This is the composition > inheritance argument in a nutshell, this is loose coupling > tight coupling. It all comes down to composability and easy testability is the #1 easiest, most effective rule of thumb I can provide that you can use to check your own system's composability.

(IME good, composable systems feel like building with legos)

While we're here - when you build easy to test (and composable) systems, I've found that a common pattern emerges:

DoBheavior
- Get Data
- Make Decision
- Take Actions on Decision

This is essentially:

Functional Core - Make Decision
Imperative Shell - Grab data, call decision, take actions

Building systems like this makes it easy to test because we can:

Test the Decision portion with pure (input, output) requirements
Test the imperative shell - given decision, do mutation

This is easy to understand (simple), easy to change decision / action (inputs, outputs) to new requirements (scalable), and is easy to test - giving us confidence our system does what it's supposed to do and is relatively composable which is a key feature of building solid systems on solid sub components.

There is no right way to code or build systems. There are always tradeoffs, each scenario is different, and we're always learning from new experiences. Here I'm sharing patterns that I've observed working with systems that tend to bias towards pits of success rather than failure. YMMV.

Q: What are your strategies for building and testing systems?

If you liked this post you might also like:

Want more like this?

The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.