Testing Data and Data Pipelines: Test-Driven Data Analysis (TDDA)

Data Analysis as if the Answers Actually Matter

Today, most software development uses some form of test-driven development (TDD), whereby extensive tests are written for software, often ahead of writing the code, and are run by “continuous integration” systems so that new bugs are likely to be identified as soon as they are introduced. This test-heavy approach was a centrepiece of eXtreme Programming (XP) and the Agile Manifesto. Just as increasing safety in transport systems allows greater speed, the safety afforded by comprehensive tests allows greater speed and freedom when developing and altering software.

Analytical data processes—modelling, reporting, scoring, inference, automated decisioning systems etc.—have traditionally used more informal processes with less emphasis on rigorous testing, but the same principles apply. As the world embraces ever more highly automated data-based decisioning, control, and reporting systems, the need for stronger testing, validation, and monitoring becomes ever greater.

Stochastic Solutions practices, teaches and advocates test-driven data analysis (TDDA), a methodology for data processes that carries the ideas of test-driven development for testing software correctness into the realm of data science, while extending them to encompass a focus on the correctness and validity of data at all stages of the pipeline, the meaningfulness of the analysis, and correctness of interpretation when formulating and communicating analyses.

A Typical Analytical Pipeline and its Failure Modes

Most analytical processes map directly onto this diagram. There are characteristic failures that happen at each stage of the pipeline. TDDA is built around identifying and tackling each of these failure modes.

Choice of Approach. When developing an analysis, we first need to formulate our approach. It is easy to fail to understand the data, the problem domain, or the methods. These are examples of errors of interpretation, more specifically errors of formulation. You can read more about how these can be made less likely, and see examples of how we have avoided these in the past, here.
Develop the Analytical Process. Having chosen an approach, we must implement it, either by writing software or using some combination of tools such as spreadsheets, checklists, and reporting systems. We call mistakes at this stage errors of implementation, which are most commonly software bugs. The primary tool for avoiding this category of errors is automated testing, but the style of tests required is slightly different in complex analytical systems, where we call them reference tests.
Run the Process. Most analysis processes, once developed, are used repeatedly, either by formally deploying them or scheduling their regular use, or more informally (“Say, could you repeat that with this month’s numbers?”) Every time a pipeline is run, there are opportunities for errors of process—feeding in the wrong inputs, collecting the wrong outputs, using the wrong version of the process, setting the parameters incorrectly, and so forth. Automation can reduce the likelihood of these mistakes but may increase their severity when they do occur. Data validation and careful monitoring can help detection of errors of process, and the methodology also provides other tools to help reduce their likelihood.
Produce the Results. A characteristic failure mode for analytical systems is that they are developed in one context or situation and then applied in a different situation. At the simplest level, over time relationships and data change, typically leading at least to gradual degradation of performance, and sometimes abrupt failure modes. More fundamentally, a system developed on one population is likely to perform less well when applied to a population that is different, whether demographically, geographically, attitudinally or in some other way. Data validation and monitoring of input and output populations and model performance are key ways of avoiding and detecting such errors of applicability.
Interpret the Results. When we have developed or run an analysis, its results usually have to be interpreted, either by ourselves or by others. It is at this point that the other main kind of error of interpretation can occur, which we call errors of communication. These can be anything but subtle: confusions about whether a larger number represents a better or worse outcome are common, as are confusions about units, timescales, relative vs. absolute risks and many others. As with some other categories of errors, it is unrealistic to expect software solutions to be able to eradicate errors of communication, but there are many best practices that can dramatically reduce their occurrence.
Act on the Results. Finally, any harms produced by an analytical process have to be included in assessing its success. An analysis that is likely to be incorrect or misleading is usually worse than useless. Appropriately communicating qualifications of outputs when there are known limitations and uncertainties is fundamental to the value of any output, as is the taking of suitable care in producing the outputs. If an analysis leads to real-world harms—whether to the environment, to people, or the public realm—that analysis should be regarded as harmful and negative, even if well intentioned.

Can We Help?

Whether you are just getting started with data science, have some processes that you suspect can be improved, or need a detailed audit of existing functionality, we can help. Examples of typical engagements include:

building data science teams that practice sound analytics with TDDA principles
auditing and advising on upgrading existing pipelines
carrying out analyses informed by these principles
advising senior non-technical staff on all aspects of high-quality analytical data processes in your company.

Resources

We want everyone to benefit from better analytical processes, so we make a lot of material freely or widely available. Our founder, Nick Radcliffe’s book Test-Driven Data Analysis is available from all good booksellers and all sellers of good books, and is being released to read for free, a chapter a week, online.

There’s also an open-source Python library (tdda) with powerful command-line tools, described here.

Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.

About • Contact • Resources • Papers • Sustainability