This post discusses pre-populating a database for database unit tests.
Your tests need data to work with. There are several approaches to supplying test data for unit tests. One approach is to use a pre-populated test database. The data in this database could be obtained from an existing customer or generated in some way. This series of posts discuss why you should NOT use a pre-populated test database for unit testing and how you should actually go about creating test data.
There are several reasons to not pre-populate test data for unit testing. Primarily these are: it makes test cases harder to read, more expensive to maintain, and limits the types of tests you can create. Let’s look at these claims in more detail:
First, the test data is in a separate place than the test case. By this, I don’t simply mean that the test data lives in a separate folder or in another database. The test data helps express the intent of the test. For example, if a test is validating a calculation based on the data, then the specific values used are an important part of the test. When reading the test case, if you have to look somewhere else to find out the intent of the test case, then you’re going to slow yourself down considerably.
As a consequence, tests often contain “magic” values which refer to special records in the test database. Trying to keep track of the meaning of special test case records is particularly difficult. For example, you’re likely to end up with a record for “Joe Smith” and everyone on the team needs to remember that Joe Smith was the customer who bought 50 widgets at a total cost of $500 but doesn’t have a credit limit (as opposed to the record for “Johnny Wilkenson” who has the same data, but has a sufficient credit limit, and so on).
If all (or even some) of the test cases share the same data set, then you will find that test cases become more difficult to maintain.
Let’s look at an example: Suppose you’re system contains several reports which provide different ways of considering the same data. The first report you write displays the name of each customer who ordered your most expensive product and the total order value. Your pre-populated test database has the following records in it:
So you’re first test expects to see the following results:
|Customers who ordered the most expensive item|
Now, it’s a month later and the next report must be written. In the meantime, the business has started selling more expensive items. This report will display all the items in the catalog costing more than $100. If you change the price of an existing item or add a new order with a new item, you’ll break the tests for the first report.
In this simple example, you might not be concerned with the amount of work to fix the first test. However, in a large system with many tests, continually fixing your existing tests just because you’ve added new functionality is painful. What normally happens when developers face this pain is they stop writing tests.
If your test database is already populated, the data in there may prevent you from writing certain tests. Consider if, for example, the code has special functionality if one of the tables is empty. You won’t be able to test this functionality if the table is populated. An alternative example is when the code is supposed to do something if data in the database meets some condition and do something else if the data does not meet that condition. Two mutually exclusive conditions are impossible to test when using a pre-populated test database.
A commonly used solution to this is to have some of the test cases modify the database for their special need and then undo the modification after the test. However, this leads to some tests containing the test data and the rest of the test data somewhere else. This means that you need to go to multiple places to update the data when schema changes are made or to get a clear picture of what is happening.
The final point I’d like to expose about using pre-populated test data is that it often leads to limited thinking about test cases. If there’s a lot of data already populated in the database, it is easy to assume that every important case is covered by the existing data. This is almost never true. The data in the database now is not necessarily a good reflection of the data that will be put in the database in the future from the perspective of unit testing. Writing unit tests requires you consider exception scenarios, inputs that don’t exist yet and varying the combinations of inputs.
Furthermore, it is more difficult to see if the important edge cases have been considered if the data is heaped into one large data set. Modifying the test data to support a new test may inadvertently remove an important condition needed for an existing test.
Unfortunately, when faced with these problems, many developers will either give up on database unit testing, or instead miss important test cases. Both of these outcomes lead to reduced code quality and makes me unhappy.
The next post in this series will propose an approach that solves these problems, and the final post will address the common concerns of the proposed approach.