Stochastic Solutions

Testing Data and Data Pipelines: Test-Driven Data Analysis (TDDA)

Data Analysis as if the Answers Actually Matter

Today, most software development uses some form of test-driven development (TDD), whereby extensive tests are written for software, often ahead of writing the code, and are run by “continuous integration” systems so that new bugs are likely to be identified as soon as they are introduced. This test-heavy approach was a centrepiece of eXtreme Programming (XP) and the Agile Manifesto. Just as increasing safety in transport systems allows greater speed, the safety afforded by comprehensive tests allows greater speed and freedom when developing and altering software.

Analytical data processes—modelling, reporting, scoring, inference, automated decisioning systems etc.—have traditionally used more informal processes with less emphasis on rigorous testing, but the same principles apply. As the world embraces ever more highly automated data-based decisioning, control, and reporting systems, the need for stronger testing, validation, and monitoring becomes ever greater.

Stochastic Solutions practices, teaches and advocates test-driven data analysis (TDDA), a methodology for data processes that carries the ideas of test-driven development for testing software correctness into the realm of data science, while extending them to encompass a focus on the correctness and validity of data at all stages of the pipeline, the meaningfulness of the analysis, and correctness of interpretation when formulating and communicating analyses.

A Typical Analytical Pipeline and its Failure Modes

Block diagram with the phases of a typical data science project,
with ticks showing the successful path and
crosses marking the various failure modes.  
The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and and an error class below that.  
The first circle is CHOOSE APPROACH, with the failure
mode `Fail to understand data, problem domain, or methods',
    and the associated error class ERROR OF INTERPRETATION
    (error of formulation).  
    The second circle is DEVELOP ANALYTICAL PROCESS with
    the failure mode `Mistakes during coding' and the associated
error class ERROR OF IMPLEMENTATION (bug).  
The third circle is RUN ANALYTICAL PROCESS with
the failure mode `Use the software incorrectly'
    and the associated error class ERROR OF PROCESS (operator error).  
    The fourth circle is PRODUCE ANALYTICAL RESULTS with the
    failure mode `Mismatch between development data or assumptions
and deployment data' and the associated error class
ERROR OF APPLICABILITY (category error).  
The fifth circle is INTERPRET ANALYTICAL RESULTS with the
failure mode `Misinterpret the results'
    and the associated error class
    ERROR OF INTERPRETATION (communication error).  
    Arrows with ticks point from each circle to the one to its
    right, representing the happy/successful path.
    Arrows with crosses from each circle point down to the
    failure mode and then the error class.  
    The final circle is labelled `First, Do No Harm'.
It has a ticked arrow pointing down to a terminal state of SUCCESS
(a square box).
Underneath that is the error class ERROR OF JUDGEMENT (real harm)
with an arrow pointing to a square, terminal FAILURE state.
An arrow with a cross points from First, Do no harm to ERROR
OF JUDGEMENT. All of the ERROR classes have an arrow pointing
to FAILURE.  
Above the first two circles is a box labelled DEVELOPMENT
PHASE and the description `Using sample/initial datasets and
    inputs to develop the process'.
    Above the next three circles is box labelled OPERATIONAL PHASE
    and the description `Using the process with other datasets
and inputs, possibly having different characteristics'.  
There is a dashed arrow from the last circle (First, Do no Harm)
back to the circle Run Analytical Process, representing the fact
that the operational phase is repeated.

Most analytical processes map directly onto this diagram. There are characteristic failures that happen at each stage of the pipeline. TDDA is built around identifying and tackling each of these failure modes.

Can We Help?

Whether you are just getting started with data science, have some processes that you suspect can be improved, or need a detailed audit of existing functionality, we can help. Examples of typical engagements include:

Resources

We want everyone to benefit from better analytical processes, so we make a lot of material freely or widely available. Our founder, Nick Radcliffe’s book Test-Driven Data Analysis is available from all good booksellers and all sellers of good books, and is being released to read for free, a chapter a week, online.

Cover of book: Test-Driven Data Analysis, by Nicholas J. Radcliffe. Published by Chapman and Hall/CRC Press (Taylor & Francis Group), part of the Data Science Series. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares, each containing dots laid out on a regular 32x32 grid. The top-left square is full (1024 dots) and working along each row in turn, the number of dots roughly halves each time, apparently at random. The last row's boxes have six, two, two, and one dot.

There’s also an open-source Python library (tdda) with powerful command-line tools, described here.

Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.
Copyright © Stochastic Solutions Limited 2007–2026.
AboutContactResourcesPapersSustainability