BlogFebruary 9, 2018
Bringing Test-Driven Development to the Data World
(TDD) is currently not as widely used in data-oriented projects as it is in software engineering. A quick Google search comes up with few relevant results and empirical studies on TDD focus on software engineering projects. However, my experience on a previous data engagement shows that TDD and other software engineering practices can be valuable tools for data professionals, especially given the increasing overlap between data and software engineering skillsets. Over the course of a year, my team built SQL and Python processes on top of a SQL Server data lake in an Agile (and later SAFe Agile) environment. We used TDD, writing tests using the unittest package in Python and the tSQLt framework for SQL.
What is TDD?
The term "Test-Driven Development" is a bit misleading, because while the first word in the term is "test," the real focus is "development." It is a programming methodology that inverts the "traditional" programming technique, requiring practitioners to write tests before they write functional code.
The steps for TDD are as follows:
- Write a test.
- Run the test; the test should fail.
- Write the minimum amount of code needed to pass the test.
- Run the test; the test should pass. If not, go back to step 3.
- Refactor code if needed to improve design and/or eliminate any duplication of code.
TDD's Upfront Costs And Learning Curve Are Ultimately Rewarding
Stakeholder buy-in is essential for executing Test-Driven Development successfully. Ultimately, TDD requires companies to pay a cost upfront for long-term benefits, such as a suite of unit tests that reduce the time spent debugging issues later. For TDD to succeed, everyone - from data practitioners to upper management - must support it.
My team was fortunate to have client buy-in to support a team norm of TDD. Adopting the methodology was challenging at first, as it required us to pick up an entirely new skillset - testing. However, I eventually saw huge benefits to using it:
- TDD encourages practitioners to focus on requirements and outcomes first, which is important for delivery to clients. We needed to know the expected outcomes-the end results-of the code so we could write tests to validate those outcomes. TDD forced us to focus on requirements often and early, from the beginning of the development process.
- Writing relevant unit tests is easier with TDD than with "test-last development." As hard as it was to get used to TDD, it was even more taxing to write good tests-especially unit tests-after code was already written. TDD requires practitioners to write code in small chunks, which inherently makes the code more testable at a finer grain.
- TDD improves test coverage and reduces the time spent debugging issues. Development usually took up entire sprints, and pushing testing to the very end of a sprint caused a rush to write tests. This made it difficult to create a robust test suite-one with enough code coverage to handle more than just high-level acceptance or integration test failures. By practicing TDD, we had a suite of unit tests throughout development. These tests greatly reduced the time we would have otherwise spent fixing issues during development, especially those that involved permissions, configuration, or interface changes. The cost of defects rises as time goes on, and fixes are easier to implement when the code is fresh in developers' minds. Fixing issues early and quickly saves time and money in the long run.
- Running tests quickly and continuously boosts confidence in the code. With TDD, tests are written before code is written. Because our applications were small, our entire test suites were constantly run throughout development. Being able to run the full test suite at any moment was psychologically relieving, and having a bevy of tests to run before each commit instilled confidence in the code.
- TDD gives practitioners the flexibility to improve code. Constantly refactoring code and improving design is an essential part of TDD. Because some tests would fail after refactoring code, we could pinpoint how the changes affected the existing code. Seeing tests work again after we made fixes reassured us that the code was working as expected, even after refactoring. This was especially helpful if we needed to change the code after it had been initially written (and forgotten); we could be sure that the changes weren't breaking existing code.
- TDD encourages and complements "clean code" practices. Modularization and adherence to the single responsibility principle (SRP) makes code more organized, testable, and maintainable. Maintainability is especially important, as the code ultimately lives in clients' codebases. TDD reinforced and almost required these best practices for our team, as it forced us to write code in small chunks and refactor constantly. In one case, rather than writing one monolithic SQL stored procedure, we opted to build a set of smaller stored procedures that each accomplished individual tasks in the data pipeline. Writing tests also served as a "check" for the code: if a test was too large, it was likely that the resulting code needed to be redesigned and further modularized. In contrast, "clean code" best practices are not reinforced with "test-last development," which can make testing even more painful afterwards.
TDD is a Tool; Use it Wisely
Learning and practicing Test-Driven Development requires discipline from practitioners. At the same time, TDD is a technique, not a law. As with any methodology, TDD alone does not produce good code, and following TDD 100%, all the time, is rarely appropriate. For example, TDD does not mandate that practitioners should not think about a good high-level design for the code initially, nor does it disallow experimentation. Even when using TDD, my team hashed out proof-of-concept code when we did not know how to accomplish a task, and we often had whiteboarding sessions to discuss initial code design.
When used with the right judgment, TDD can help practitioners write better code and more robust test suites. Adopting this technique has given me a new perspective on coding and has helped me become not only a better data engineer, but a better consultant as well. In my experience, TDD reinforces coding best practices that improve the quality of deliverables, driving practitioners to focus on requirements and increasing the maintainability of code after a project ends.
 Martin, Robert C., et al. Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall, 2016.