Essay - Published: 2024.08.23 | create | software-engineering | tech | testing |
DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)
Testing is a chronic pain point and target of discussion in software engineering. I've recently been in another round of discussions about how to write tests for large systems at work and best practices for orgs / teams. Here I wanted to share some of these thoughts to spread the discussion and battle-harden some of these ideas.
I personally think I'm pretty lax on most coding philosophies - if what you wrote is a Simple Scalable System solving the problem at hand then it gets a pass from me. But I've also noticed many patterns of failure over my 7 year career working as a software engineer at companies large and small.
Of note are common patterns of testing that end up being complex, hard to reason about, and don't scale well for inevitable changing requirements. Worse we get 1000s of tests but the system remains brittle and no one can understand whether tested requirements are useful or not - basically an untested system with extra steps.
Here I'm going to share some observations about testing and some tactics I've found helpful in doing this better at scale. I'm not saying this is the "right" way - there is no right way - but I am saying I've observed these practices to work better than other alternatives in more situations.
Want to support this blog? Fun() tshirts now available on HAMY SHOP.
The goal of a test is to make sure a system does what it's supposed to. A system only exists to produce some impact in the outside world. This means what we care about (and by extension what we test) is - given x inputs, expect y outputs.
Testing is useful for businesses / orgs / eng teams for several reasons:
For a further exploration of the software as a virtual building metaphor - see: Why Type-safe Programming Languages are better than Dynamic and Lead to Faster, Safer Software at Scale
Of particular note - the outside world does not care about how this x -> y conversion takes place, just that it takes place within their expected constraints.
This makes sense. Let's say there's a vending machine that takes $x and gives y drink. As a customer if I pay $x I expect y drink. I don't care how that drink gets to me, it just needs to get to me (and probably do so fast / unbroken / etc).
This means it doesn't matter what the mechanism is to provide drink. It could be a robot. It could be a human. It could be a Lovecraftian horror. It doesn't matter - I paid $x, I need y drink.
This may seem like a minor point but this is where I see a lot of tests go wrong - they test for the implementation and not the behavior. This often leads to dozens of tests for a given system that are enforcing that robot / human / horror is doing something but often misses the simple fact that what we care about is $x -> y drink.
Moreover it makes each inevitable requirements change harder to make. If product decides to go from robot -> horror now all the tests are broken. Is the vending machine actually broken though? Unclear cause first we gotta change all the internal implementation tests to pass then look at the behavioral tests. But where were the behavioral tests again? Ah I can't find them because they're hidden amongs the dozens of implementation tests. Oh finally found them - wait are they still testing the right thing? I just changed a ton of tests to get them to pass - was that behavioral or implementation?
Okay this is a bit contrived and extreme but this happens so frequently that frankly I'm frustrated we keep doing this to ourselves.
So to sum up - the reason we write tests is to ensure the system produces the impact it's supposed to (it follows the requirements). A system's impact is its behavior - NOT its implementation. So to test the impact of a sytem we need to be testing its behavior - given x, output y.
I like to build Simple Scalable Systems. This is because:
If the goal of a test is to ensure our system is doing what it's supposed to, the test needs to take a form that it can change along with the system it tests. Ideally we want the test to require minimal extra effort (unnecessary complexity) on top of the primary system to handle these changes while still maintaining correctness.
Common changes that tests need to scale for:
For real-world testing purposes we have several dimensions of scalability that I think are important to handle these scenarios:
Some strategies I've found that generally help with test scalability:
Data-driven tests
Usually several testcases that contain (inputs, outputs) for a given subsystem. This makes it easy to see what this system's contract is - what it currently does (source of truth), how to add / modify / remove tests cases, and how to change the contract entirely (changing input, output composition).
Moreover this helps to avoid what I consider "false" tests - these are often tests proving one particular scenario of a system that look like they're proving a particular variable leads to a paricular outcome when actually they're load-bearing on another variable. Forcing multiple test cases helps to prove that this variable is indeed the controlling factor - think of it like a null case in an experiment. (I see these "false" tests very often in codebases - I would posit it's common for 30% of tests do this).
Test behavior of subsystem (not internal implementations)
This is not a hard and fast rule but ask yourself if each test is enforcing behavior or implementation. Sometimes it's useful to test internal implementations if that part itself is a subsystem that would benefit from behavior assertions (like a particularly complex math calculation). But generally implementation testing is useless without behavior testing (cause bugs from system interaction are usually harder to catch) so often you're better off starting there.
You can have 100%, even 1000% coverage (covering same code 10x) of your implementations but if the system behavior is still wrong then all of that coverage is useless. (Examples where this fail is that each implementation is proven to work in isolation but $x -> y is still broken. This often happens when we have implicit logic - like untyped returns, exception control flows, data mutations etc - that happens IRL but doesn't occur / isn't tested in isolation).
Good rule of thumb to see if you're testing subsystem behavior or not is to ask yourself if this is a public API / interface of the system. If what you're testing is not / should not be used by end users (whether a customer or a team using your function / class / library) then this is probably an implementation detail. If it is surfaced then it is most likely a behavior that users could depend on and thus is probably something useful to test.
Avoid mocks
Mocks are lies. You want to avoid mocks as much as possible. Mocks do not behave like production systems so what are you actually testing? A lie. Every time you use a mock you are testing another lie. This leads to you testing more and more lies and eventually you aren't even testing the system just a bunch of assumptions. And assumptions are often wrong. So avoid mocks (and lies) as much as possible.
Now there are some cases where mocks are useful (like calling external APIs) but you should always think critically about whether this mock is helping the test (by avoiding irrelevant setup / baggage) or hurting the test (by missing potentially crucial edge / failure cases you won't see til prod).
Based on these observations, I've coalesced on a general pattern of testing that I've found works well across many scenarios at many dimensional scales. There are certainly other ways of testing and this will not be the best method in all scenarios but it's the best Simple Scalable System I've found for testing thus far and what I use most often so I'll share it here.
A test has 3 stages:
I even go so far as to comment out these sections because I find it helps keep tests organized and manageable which generally helps it remain a clear proof rather than a glob of extra code no one understands.
If you're testing behavior of a subsystem the test body can almost always be pretty short and straight forward. This is because the behavior itself should be pretty deterministic - given x, do y. (It's literally just a list of the requirements for the system.)
If your test is not this straight forward or your behavior is not deterministic you should really think about why. This is often a sign that your system itself is not operating in a Simple, Scalable way and could benefit from a refactor. In many cases this is due to implicit mutations / dependencies or this system is not modeling the domain effectively - focused on a bunch of technical operations instead of the behavior the system needs it for.
(Aside: The whole basis of Domain-Driven Design is to get systems to more closely resemble the real world the system is built / modeled for so that we can eliminate unnecessary complexity caused by these conceptual gaps between system <> reality. For more: here's the best book I've read on Domain-Driven Design.)
I'm not a strong proponent for TDD - I like testing but I often write my tests after I have a general sense of what my system is going to look like so I'm more build then test then refactor. But I do think that testing is super important to do while you're building up your system. Testing is often the first time your system actually gets used by anyone and therefore the first signal you can get about how this system actually works e2e. So if you - the developer of the system (and defacto expert) - cannot easily use your system in a test then you must expect that it is magnitudes harder for some other developer to use successfully in production.
Don't ignore this signal! It is almost always telling you something useful.
Okay now let's look at what an example test following these principles might look like. Note I'm using pseudo code here because the code itself doesn't actually matter - it's the logical setup.
This is similar to the scientific method - it doesn't prescribe a specific implementation but it provides a framework for how to run experiments (tests!) to check (input, outputs) while minimizing noise from external factors. So here I'm providing a mechanism for structuring your experiments to try and minimize complexity while maximizing scalability and correctness.
This approach is simple:
TestCase_ExampleTest:
* name: string
* intVar: int
* expectReturn: bool
let allTestCases = [
TestCase_ExampleTest(
name="Failure - 1",
intVar=1,
expectReturn=False
),
TestCase_ExampleTest(
name="Pass - 0",
intVar=0,
expectReturn=True
),
]
let testShowExample = fun() ->
for testCase in allTestCases:
# Arrange
# ... do your setup here
# Act
let result = example(testCase.intVar)
# Assert
assert result == testCase.expectReturn
This pseudo code can be transalated into any programming language with any testing framework. It does not rely on any particular dependencies and is easy to scale up or down in any dimensional scale.
How does this scale?
Don't worry - you're not alone! This is the most frequent complaint I get when I introduce testing like this. It's fair - it is hard to test many systems in this manner. Plus this is likely different from how you've written many tests in your career.
So I see you. You're right. The complaint is valid.
However I would argue that this is a red flag for the system, not the methodology. If you cannot test the system - how do we know it's doing what it's supposed to? If we don't know it's doing what it's supposed to, how likely is it that it's actually doing that?
If it's hard for you to test a system - it's probably even harder for someone else to use that system in prod. If we can't use that system correctly in prod then it's likely it will be misused instead. So if we cannot test that a system does what it's supposed to we kinda need to assume it doesn't do that.
So if you have a system like this - it might be worth thinking about changing it.
When we build systems that are easy to test we get a lot of benefits:
What makes systems more testable is often that the components under test are more composable. This means that the pieces are easier to control via inputs with more deterministic outputs which also makes this piece easier to use and reuse in other parts of the system.
This means that these more testable components are almost always easier to use / reuse in prod, easier to evolve over time, and we can do so with confidence because they're better tested. This really is the root of what makes "good" software. This is the composition > inheritance argument in a nutshell, this is loose coupling > tight coupling. It all comes down to composability and easy testability is the #1 easiest, most effective rule of thumb I can provide that you can use to check your own system's composability.
(IME good, composable systems feel like building with legos)
While we're here - when you build easy to test (and composable) systems, I've found that a common pattern emerges:
This is essentially:
Building systems like this makes it easy to test because we can:
This is easy to understand (simple), easy to change decision / action (inputs, outputs) to new requirements (scalable), and is easy to test - giving us confidence our system does what it's supposed to do and is relatively composable which is a key feature of building solid systems on solid sub components.
There is no right way to code or build systems. There are always tradeoffs, each scenario is different, and we're always learning from new experiences. Here I'm sharing patterns that I've observed working with systems that tend to bias towards pits of success rather than failure. YMMV.
Q: What are your strategies for building and testing systems?
If you liked this post you might also like:
The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.