Testing Data & Data Processes with AI & Python
Half-day Training • Edinburgh • 20th March 2019
Location: BMA Scotland, 14 Queen Street, Edinburgh, EH2 1LL, Scotland.
DataFest 2019 brings together local and international talent, industry, academia and enthusiasts who all share at least one interest — data! With a desire across sectors to succeed at Data Driven Innovation, how can we be sure that our data — our raw material — is as good as it should be?
This training brings the ideas and benefits of test driven development to the arena of data analysis. Using the open source Python TDDA library(test-driven data analysis), we'll work with data in CSV files, Pandas DataFrames, and relational databases.
Part 1: Testing Data Processes and Pipelines
Introduction to reference tests and how these can be written for various kinds of analytical processes over different data types. Topics will include:
- Motivation for and introduction to testing
- Special considerations for testing analytical software and processes
- Testing and regenerating complex and partially variable outputs, and supporting diff tools.
Part 2: Using AI to Generate Constraints from Data and their use for Detecting Bad Data
Using constraints to verify data, including:
- identification of unexpected changes, outliers, duplicates, missing and disallowed values
- advanced string verification, including automatic generation of regular expressions to characterise patterns in text data using rexpy.
Crucially, we will show not only how constrains can be used to detect change and problems in data, but also how those constraints can be automatically generated using AI methods in the tdda library.
The methods and tools are applicable to structured data and data pipelines using any software, not just Python.
WHO IS THIS TRAINING FOR?
The course is primarily aimed at practising data scientists with some familiarity with Python, or programmers coming to data science. Previous experience of testing and Pandas will be advantageous but is not required.
Although the specific library used is Python, the data testing is almost entirely language neutral, and even the testing of data processes can be used with other languages, from within a Python test script.
Non-programmers with an interest in QA for data and data processes will also benefit from some of the overview material, and are welcome to attend, but may need more help with the hands-on parts of the course.
It is essential that attendees bring a laptop (Mac, Linux or Windows) with a working python environment installed with Pandas, NumPy, as well as the TDDA library (tdda; available with pip from PyPI, and in source form on Github).
Detailed instructions on system configuration will be supplied to registered attendees before the session, as well as instructions on how to test the installation. These instructions are also available here.
Help will be available at the venue in the 30 mins prior to the start of the workshop (from 13:30) for anyone unable to configure their environment.