Stochastic Solutions

Testing Data and Data Pipelines: Test-Driven Data Analysis (TDDA)

Overview of Test-Driven Data Analysis

Test-driven data analysis (TDDA) is an approach to improving the correctness and robustness of analytical processes by transferring the ideas of test-driven development from the arena of software development to the domain of data analysis, extending and adjusting them where appropriate.

A Methodology and a Toolset

TDDA is primarily a methodology that can be implemented in many different ways, but good tool support can facilitate and drive the uptake of TDDA. Stochastic Solutions provides an open-source (MIT-licensed) Python module, tdda, for this purpose.

Key Ideas

Reference Tests. Reproducible research emphasises the need to capture executable analytical processes and inputs to allow others to reproduce and verify them. Reference tests build on these ideas by also capturing expected outputs and a verification procedure (a “diff” tool) for validating that the output is as expected. The tdda Python module supports testing using comparisons of complex objects with exclusions and regeneration of verified reference outputs.

Constraint Discovery & Verification. There are often things we know should be true of input, output and intermediate datasets, that can be expressed as constraints—allowed ranges of values, uniqueness and existence constraints, allowability of nulls etc. The Python tdda module not only verifies constraints, but generates them from example datasets, thus significantly reducing the effort needed to capture and maintain constraints as processes are used and evolve. Constraints can be thought of as (unit) tests for data.

Motivation

Getting data analysis right is hard. In addition to all the ordinary problems of software development, with data analysis we often face other challenges, including poorly specified analytical goals problematical input data—poorly specified, missing values, incorrect linkage, outliers, data corruption possibility of misapplying methods problems with interpreting input data and results changes in distributions of inputs, invalidating previous analytical choices.

TDDA Resources

Python Library:
pip install tdda git clone https://github.com/tdda/tdda.git
Blog: TDDA Blog
Twitter: @tdda0
Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.
Copyright © Stochastic Solutions Limited 2007–2023.