Stochastic Solutions

Testing Data and Data Pipelines: Test-Driven Data Analysis (TDDA)

Overview of Test-Driven Data Analysis

Test-driven data analysis (TDDA) is an approach to improving the correctness and robustness of analytical processes by transferring the ideas of test-driven development from the arena of software development to the domain of data analysis, extending and adjusting them where appropriate.

A Methodology and a Toolset

TDDA is primarily a methodology that can be implemented in many different ways, but good tool support can facilitate and drive the uptake of TDDA. Stochastic Solutions provides an open-source (MIT-licensed) Python module, tdda, for this purpose.

Key Ideas

Reference Tests. Reproducible research emphasises the need to capture executable analytical processes and inputs to allow others to reproduce and verify them. Reference tests build on these ideas by also capturing expected outputs and a verification procedure (a “diff” tool) for validating that the output is as expected. The tdda Python module supports testing using comparisons of complex objects with exclusions and regeneration of verified reference outputs.

Constraint Discovery & Verification. There are often things we know should be true of input, output and intermediate datasets, that can be expressed as constraints—allowed ranges of values, uniqueness and existence constraints, allowability of nulls etc. The Python tdda module not only verifies constraints, but generates them from example datasets, thus significantly reducing the effort needed to capture and maintain constraints as processes are used and evolve. Constraints can be thought of as (unit) tests for data.

Motivation

Getting data analysis right is hard. In addition to all the ordinary problems of software development, with data analysis we often face other challenges, including poorly specified analytical goals problematical input data—poorly specified, missing values, incorrect linkage, outliers, data corruption possibility of misapplying methods problems with interpreting input data and results changes in distributions of inputs, invalidating previous analytical choices.

TDDA Resources

Python Library:
pip install tdda git clone https://github.com/tdda/tdda.git
Book: TDDA Book
Blog: TDDA Blog
Mastodon: @tdda@mathstodon.xy
The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe.
It is published by Chapman and Hall, part of CRC Press, from Taylor & Francis Group, and is part of the DATA SCIENCE SERIES.
The cover is black with mostly white text and a white graphic.
The graphic is a 3-row by 4-column grid of squares.
Each sqwuare contains a number of dots laid out on a regular 32x32 grid.
The top-left square has 1024 dots (“full”) and working along each row
in turn, the number of dots roughly halves each time, apparently at
random (and, actually, pseudo-randomly).
The last rows boxes have six, two, two, and one dot. 20% discount on the TDDA Book in any format from the publisher with code 26SMA1 until 30 June 2026.
Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.
Copyright © Stochastic Solutions Limited 2007–2026.