Menu
SEGMENTATION & PROFILING
MEASUREMENT & ANALYSIS
HIGH-QUALITY TEST-DRIVEN DATA ANALYSIS
DATA ENGINEERING
ANOMALY DETECTION

TEST-DRIVEN DATA ANALYSIS

CHOOSE APPROACH Misinterpret problem or methods ERROR OF INTERPRETATION DEVELOP Mistakes during coding ERROR OF IMPLEMENTATION (bug) RUN Use the software incorrectly ERROR OF PROCESS PRODUCE RESULTS Data drift ERROR OF APPLICABILITY INTERPRET Misinterpret results ERROR OF INTERPRETATION SUCCESS Rerun on updated data CHOOSE APPROACH Misinterpret problem or methods ERROR OF INTERPRETATION DEVELOP Mistakes during coding ERROR OF IMPLEMENTATION (bug) RUN Use the software incorrectly ERROR OF PROCESS PRODUCE RESULTS Data drift ERROR OF APPLICABILITY INTERPRET Misinterpret results ERROR OF INTERPRETATION SUCCESS Rerun on updated data
  • Is your data science as good as it could be?

  • How much of the time do you think your analytical results are even broadly correct?

  • Data science as if the answers actually mattered?

  • Why should anyone believe your analytical results?

Test-Driven Data Analysis (TDDA)

Overview of Test-Driven Data Analysis

Test-driven data analysis (TDDA) is an approach to improving the correctness and robustness of analytical processes by transferring the ideas of test-driven development from the arena of software development to the domain of data analysis, extending and adjusting them where appropriate.

A Methodology and a Toolset

TDDA is primarily a methodology that can be implemented in many different ways, but good tool support can facilitate and drive the uptake of TDDA. Stochastic Solutions provides an open-source (MIT-licensed) Python module, tdda, for this purpose.

Key Ideas

Reference Tests. Reproducible research emphasises the need to capture executable analytical processes and inputs to allow others to reproduce and verify them. Reference tests build on these ideas by also capturing expected outputs and a verification procedure (a “diff” tool) for validating that the output is as expected. The tdda Python module supports testing using comparisons of complex objects with exclusions and regeneration of verified reference outputs.

Constraint Discovery & Verification. There are often things we know should be true of input, output and intermediate datasets, that can be expressed as constraints—allowed ranges of values, uniqueness and existence constraints, allowability of nulls etc. The Python tdda module not only verifies constraints, but generates them from example datasets, thus significantly reducing the effort needed to capture and maintain constraints as processes are used and evolve. Constraints can be thought of as (unit) tests for data.

Motivation

Getting data analysis right is hard. In addition to all the ordinary problems of software development, with data analysis we often face other challenges, including poorly specified analytical goals problematical input data—poorly specified, missing values, incorrect linkage, outliers, data corruption possibility of misapplying methods problems with interpreting input data and results changes in distributions of inputs, invalidating previous analytical choices.

TDDA Resources

Python Library:
pip install tdda git clone https://github.com/tdda/tdda.git
Blog: TDDA Blog
Twitter: @tdda0

Services

TARGETING

SEGMENTATION & PROFILING

DATA QUALITY SYSTEMS

ETL AND DATA CONSOLIDATION

REPORTING

FACILITATED, DATA-INFORMED STRATEGY

Why Us?

Lots of people can build customer behaviour models for you, or audit your analytical marketing, or discuss your customer management strategy. Most of them are bigger and better known than Stochastic Solutions. So why us?

What we're best at is aligning all the maths and stats and technologies that businesses use to deliver effective customer management towards the organization's goals. We can engage across the full spectrum, from setting good marketing goals through accurate measurement of success to segmentation, modelling and optimization. In short, we concentrate on asking the right questions. Often, that leads to a change of goal and problem formulation. When it does, sometimes the same methods suffice to tackle the new formulation, and sometimes new or different methods are needed; if they are, we develop or find those.

Targeting

While targeting using conventional response modelling is generally much more effective than either a "gut-feel" approach or blanket contact, there are some unpalatable and under-appreciated facts.

  • It is normally assumed that the worst outcome direct marketing activity can have is to waste money. In fact, some direct marketing provably drives away business within certain segments, and it is not unknown for it to drive away more business in total than it generates. This is especially true in retention activity.
  • The use of control groups is a cornerstone of state-of-the-art customer targeting, and is certainly a prerequisite for allowing companies to measure the true incremental impact of any one-to-one customer management approach. However, measuring the net effect of a marketing programme is not the same as optimizing that net effect.
  • Even in the most analytically sophisticated companies, it is surprisingly common for false conclusions to be drawn from control groups. There are many and varied causes of this. One common cause is that somewhere between conception and execution of the campaign, some influence causes control groups to be invalidated. Another is that post-campaign analysis fails, in one way or another, to perform a valid like-for-like comparison, again leading to invalid conclusions.

Stochastic Solutions staff have deep experience of both the design of direct marketing programmes and their post-campaign analysis. We can use this expertise to audit and verify the effectiveness of current practices, and to work with companies to help ensure the best planning of future activity.

In addition to this, we have deep expertise in a scientific approach to taking marketing to the next stage, using uplift modelling to optimize the targeting of direct marketing and customer management activity to maximize the net (or incremental) impact of campaigns.

Of course, uplift modelling is no panacea, and will not always lead to better results. In some situations, the uplift approach adds nothing because an uplift model ends up targeting the same people as a conventional approach. This situation pertains when incremental impact and purchase rates are strongly correlated. In other cases, typically when control groups are very small, there is too much noise in the data for an uplift approach to be effective at all, though remarkable strides have been made in extracting meaningful patterns even with unreasonably small control groups.

Frequently, however, the difference and uplift approach makes is breath-taking. We have used the uplift approach to double the profitability of already highly profitable campaigns; in other cases, we have taken campaigns that were heavily loss-making, sometimes because of the sort of negative impacts discussed above, and found segments of customers who can be profitably targeted.

Whatever stage of sophistication your business is at with targeting or other customer decisioning, Stochastic Solutions can help you to take it to the next level. If there is potential to benefit from more sophisticated use of control groups and incremental modelling, we can help you chart a path to gaining it. If there's not, we can at least ensure that you have in place the tools and methods to allow you to detect that potential if and when it arises.

Better Retention Targeting With Uplift Modelling

Most retention is implicitly based on the idea that the best people to target are those most likely to leave. This is rather like trying it improve an exam pass rate by directing most attention to the lowest achievers: it may be heroically worthwhile, but it probably isn't the easiest way to achieve the stated goal.

Churn and attrition models prioritize customers whose probability of leaving is highest. Such customers tend to be dissatisfied, so are usually hard to retain. To make matters worse, in many cases, the only thing currently keeping them is inertia, and interventions run a serious risk of back-firing, triggering the very defections they seek to avoid.

It is more profitable to focus retention activity on those people who are easiest to save—those most receptive to our retention programmes. Like focusing effort on students who are otherwise likely narrowly to fail the exam, this is generally the most efficient strategy for improving the measured outcome.

Uplift retention Venn Diagram

The customers who generate a positive return on investment from retention activity investment are those in red—the people will leave without an intervention, but who can be persuaded to stay. Uplift models allow you to target them, and them alone. At all costs, you want to avoid targeting the group in black, (so-called Sleeping Dogs), whose defection you are likely to trigger by your intervention. Again, uplift models can direct you away from those customers.

In contrast, standard approaches based on churn or attrition scores tend to direct attention towards the wrong groups, including, in many cases, the Sleeping Dogs. Targeting them is a disaster, as the organization actually spends money to drive away business. Even where this is avoided, traditional targeting inevitably focuses attention on customers who are hard to save, while overlooking those who are more receptive.

Stochastic Solutions has unparalled experience in helping companies to build uplift models that predict the incremental impact on retention of targeting each customer. Standard stats packages and methods simply cannot build uplift models, so you need a specialist approach. By using such incremental models, you align your targeting with the outcome that you measure (the net increase in retention achieved by your campaign) and the very metric that determines the value of the retention activity.

Contact Stochastic Solutions on +44 7713 787 602 or at info@StochasticSolutions.com, and let us help you increase sales by targeting the people whose behaviour is actually positively influenced by your marketing.

Cross-Selling With Uplift Modelling

You probably already use a control group to measure the net impact of your marketing. You do this because you know that some of the people who buy after being exposed to your marketing would have bought anyway. The control group allows you to measure the incremental impact or uplift.

But unless you're very unusual, when choosing who to target, you don't use an incremental approach: you just use a response model, or a propensity model, to try to people who are likely to buy, with no regard to incrementality.

Uplift Cross-Sell Venn Diagram

The only prospects that generate a return on marketing investment are those in red—the people who buy only when they receive your marketing. Uplift models allow you to target them, and them alone.

In contrast, standard approaches based on response or propensity models direct the bulk of their effort at those shown in white (people unaffected by the marketing), and possibly even at the group shown in black (people negatively affected by your marketing), while sometimes missing some of the persuadable reds. This is doubly bad, resulting in wasted spend, targeting people who would have bought anyway; and missed opportunities, failing to target people who may not be very likely to buy even if you do target them, but are almost certain not to if you don't.

Stochastic Solutions has unparalled experience in helping companies to build uplift models that predict the incremental impact on sales of targeting each person in your prospect pool. Standard stats packages and methods simply cannot build uplift models, so you need a specialist approach. By using such incremental models, you align your targeting with the outcome that you measure (the lift of your cross-sales campaign) and the very metric that determines the volume of sales you make.

Triggered Churn

Optimization

Randomized, but not Random

The first thing to understand about randomized (stochastic) search is that it is not the same thing as random search. Not even close.

It is this fundamental confusion that is behind many people's difficulty with the idea that evolution could possibly have produced the richness and sophistication of life we see on Earth. They focus on the "random" nature of mutation and reason that just changing things randomly can't possibly produce a brain, a butterfly, an oak tree or even a single-cell organism. And they're right. It's selection that does the heavy lifting. The random nature of mutation simply provides variation for selection — survival of the fittest — to winnow down. Most mutations are harmful, destroying useful features that have been built up, and most of those that aren't harmful, are neutral, neither improving nor harming the organism. It's the rare few that actually make something better, and it's the role of selection to favour those few. Even then, the process isn't automatic: an organism with an advantageous mutation, axiomatically has a better chance of surviving and reproducing than the same organism that doesn't (because that's how we define selective advantage). But that organism can be unlucky and die young or fail to reproduce. So selection too has a strong random element. However, even a small and probabilistic selective advantage is multiplied exponentially through the generations, with the consequence that improving mutations build up.

Some of the stochastic search methods we use at Stochastic Solutions are directly modelled on natural evolution — techniques such as genetic algorithms, evolution strategies and genetic programming. Others, like simulated annealing, take their inspiration from other natural stochastic processes, such as the way a metal cools.

Representation • Domain Knowledge • Move Operators

Our approach to search is informed by the insight that three features are dominant in determining the effectiveness of optimization methods. These are domain knowledge, problem representation and choice of move operators.

Red triangle illustrating the central roles of representation, domain knowledge and move operators in search

It all starts with domain knowledge, because without that stochastic methods are reduced to the very aimless wandering that is evolution's caricature.* So our first step is always to capture what is known about the problem from whatever sources of information are available. This can include interviewing domain experts, studying current and previous approaches, reviewing the literature and, where possible, directly probing or studying whatever system is being optimized.

The domain knowledge then has to be encapsulated in a way that makes it available in a useful form to the search algorithm. This is achieved through a combination of the choice of problem representation (logical, rather than physical, normally) and the move operators to be employed during the search.

Nick Radcliffe, who founded Stochastic Solutions, has worked for many years on the relationship between these three pivotal aspects of search, and has developed, through a series of publications, a solid theory of representation for stochastic search in general, and evolutionary algorithms more particularly, called forma analysis. This is an intensely practical theory that helps move from specific insights about a problem, through a systematic process that aids the production of suitable problem representations and move operators. These can then be used directly, or modified further, using heuristic insights, to produce a sound and effective approach to the problem at hand.

*The careful reader may wonder where natural evolution's ``domain knowledge'' comes from. The difference here arises because our goal is to harness the power of evolution to to a particular end — usually, to optimize a function. In natural evolution, the goal is implicit: it is survival through the generations. It is in bending evolution to our own ends that the requirement for domain knowledge surfaces.

Hybridization

Staff at Stochastic Solutions, have a long history of harnessing and exploiting the power of random variation and using it to solve challenging industrial and commercial problems. We do this by combining strong theoretical and technical knowledge of cutting-edge techniques with ruthlessly practical and pragmatic approaches to exploiting all other information and methods that can help to crack the problem in question. This leads us to favour hybrid approaches, whereby we try to incorporate existing search and optimization approaches into either evaluation functions or move operators. Because stochastic search methods, especially those based on evolutionary paradigms, provide excellent frameworks for this approach, this usually allows us to produce systems that out-perform both the existing approaches and a purer methodology based on a single stochastic search paradigm. We love theory, and admire purity, but in the end we do whatever it takes to get the job done.

Applications

Successful applications of this approach by staff at Stochastic Solutions have come in many industrial and commercial settings. One application was optimizing the design of gas pipelines to supply cities. Here, the goal was to minimize the cost of the pipeline while satisfying all engineering and safety constraints. Another was credit scoring, where we produced a hybrid solution that combined best-practice scorecarding with an evolutionary approach that produced a solution better than had previously been believed to be possible. We have also applied these methods successfully in fields as diverse as retail dealership location, oil production scheduling and computational process placement. More recently, we have harnessed the power of stochastic search to optimize the data-preparation phase that typically dominates the time spent in predictive modelling and data mining.

Whatever your requirement for optimization, search, covering, or constraint satisfaction, Stochastic Solutions will work with you to harness modern search methods to solve your problem.

Our Miró software is an integrated analytical tool covering data extraction, manipulation, exploration, reporting, prediction, and test-driven data analysis. It features a web-based interface for mixed text and graphical output, as well as off-line script execution, and a Python API. It is currently in integrated production use at client sites as well as being a core tool for our consulting engagments.

Exploratory Analysis

Almost every data science project begins with an exploratory phase in which the analyst learns about the data and tests ideas, usually using a mixture of fast-counts and aggregations, visualization, filtering, segmentation, deriving new fields and so forth. Miró is particularly well-suited to this phase, and enhances its utility by keeping an executable audit trail of what has been done, allowing this initial analysis to be efficiently translated into a more production-ready phase.

Production-Oriented Analytics

Miró implements production-oriented analytics, meaning that it focuses on allowing analysts to get results as quickly and painlessly as possible, from data import to production-ready or near-production-ready output. Its Unix-style command-line interface is normally accessed through a web browser, allowing rich text and graphical output, but is also fully functional through plain-text terminal, locally or on a remote server.

Miró generates high-quality, sometimes graphical output, drawing inspiration from Edward Tufte, minimizing chart junk and maximizing meaningful information content. It also has the ability to produce animated output, HTML reports, text files, Excel spreadsheets and to write directly to database tables.

Test-Driven Data Analysis

Miró includes all the functionality from our open-source TDDA library for test-driven data analysis, together with various enhancements including constraint generation in the presence of bad data, support for between-field constraints, integrated reporting and history tracking and associated profile-and-audit functionality. Miró reads and writes the same TDDA files as the open-source version, allowing the two to be mixed, but gives a more seamless, polished, supported experience compared with the open-source package.

Web Applications

Once an analytical process has been developed using Miró, it is extremely simple to turn it into a web app with an arbitrary user interface. Miró can present any input parameters to a user, run analytical processes, and present the output, all through a standard web browser. There are then layers of customization that can easily be performed to take more control over the input controls, the output layout, the styling etc. through a combination of writing HTML templates, CSS, and—for more interactive applications—JavaScript.

Interfaces

Miró provides multiple interfaces, including a programmatic interface (an API), a command-line/scripting interface and interactive web access. The API layer makes it a powerful base for embedded analytical applications. Miró also includes a very powerful expression language for data manipulation.

Audit-Trail

Miró datasets contain an audit trail showing the sequence of operations that resulted in any final dataset, allowing diagnosis of problems and tracking of data provenance. It also allows the full history of datasets to be reliably traced, even when they may have been worked on across multiple sessions, perhaps on multiple machines, by multiple people.

Scripting by Doing

Miró automatically generates detailed logs providing not only a further audit trail, but also the ability to rerun analysis sessions, either verbatim or with specified modifications. It logs both command sequences and output (in multiple forms) meaning that work is never accidentally lost, results can always be traced in ad hoc analyses can always be repeated or turned into re-usable scripts.

Cross-Platform

Miró is cross-platform (across Unix, Linux, Mac and Windows) with a focus on standards compliance.

Native and Database Back Ends

All Miró functionality is available using its native back-end, in which data is stored in its own column-oriented data store and all manipulations are performed directly by Miró code. This is suitable for interactive use and batch use

A significant subset of Miró's functionality is also available using a database back end. In this mode, Miró connects to a database and collects metadata, but does not extract the main data from tables. Rather, Miró issues SQL (and in some cases calls in-database functions) to perform equivalent operations. Depending on the relative power and capacity of the machine running Miró and the database hardware, as well as data volume and the nature of the operations being performed, this can sometimes be faster and sometimes slower than extracting the data into Miró, performing whatever analysis is required, and writing any results back. The level of support varies across database systems, but includes Postgres, Greenplum, MySQL, SQLite and MongoDB

This approach also allows analytical workflows to be developed in one mode (most commonly using the native back end) and then deployed, with minimal or no changes, using a database. This is a popular development-production split for some clients.

About Us

Stochastic Solutions delivers consulting and software in the area of data analysis with a specific focus on customer behaviour modelling. We combine a modern software engineering mindset with deep knowledge and experience of large-scale data and predictive modelling. As a result, we deploy high-quality, tested, large-scale self-monitoring modelling and analysis systems to our clients, using a mixture of standard, packaged and custom software.

Our team combines experience and perspectives from mathematics, statistics, machine learning, software engineering, quality assurance and testing, parallel processing, visualization, and operational research. We produce our own software for data analysis (Miró and the Artists Suite) which we use in conjunction with standard (mostly free and open source) software to deliver client solutions. We place great emphasis on correctness and robustness of solutions, and carry over many of the ideas from software engineering (such as test-driven development, regression testing, automation, revision control) to the analytical domain, ensuring that as we develop and when we deliver solutions to clients, there can be confidence in the correctness and reliability of those solutions. Our analysis software, Miró, is specifically designed to allow efficient exploratory analysis while automatically logging both executable scripts and full results, as well as creating a powerful audit trail and production-ready output. As a result, we are able to move seamlessly from exploratory analysis and prototyping to deliverable solutions without the need to translate or re-implement algorithms or code.

Our People

Nick Radcliffe Sam Rhynas Simon Brown
Nick Radcliffe Sam Rhynas Simon Brown
Chief Executive Officer Head of Operations Head of Engineering

Nick Radcliffe

Stochastic Solutions was founded by Nick Radcliffe to help companies with targeting and optimization.

Prior to founding Stochastic Solutions, Nick founded and acted as Chief Technology Officer for Quadstone Limited, an Edinburgh-based software house that specialized in helping companies to improve their customer targeting. While there, he led the development of a radically new algorithmic approach to targeting direct marketing which has repeatedly proved capable of delivering dramatic improvements to the profitability of both traditional outbound and more modern inbound marketing approaches, in an approach known as uplift modelling. Quadstone was acquired by Portrait Software in late 2005.

Through working with many companies in financial services, telecommunications and other sectors, it became clear to Nick that uplift modelling can provably increase the profitability of direct marketing for most large B2C companies. However, it became equally clear that there are many non-analytical challenges that prevent the majority of companies from being ready even to evaluate this approach at present, let alone to benefit from it. One of the founding visions of Stochastic Solutions is to help companies improve their approach to the systematic design and measurement of direct marketing activities in ways that bring immediate benefits while also preparing them to be able to evaluate properly the potentially huge benefits of adopting this radical new approach. The concepts around uplift modelling are discussed in his blog, The Scientific Marketer.

Nick is also a Visiting Professor of Mathematics at the University of Edinburgh, working in the Operational Research group. His research has focused on the use of randomized (stochastic) approaches to optimization, and he was one of the early researchers in the now established field of genetic algorithms and evolutionary computation. He has over many years successfully applied stochastic methods to real-world industrial and commercial problems as diverse as retail dealership location, credit scoring, production scheduling and gas pipeline design, and has published several dozen research papers in the area. He has also, while at Quadstone, combined stochastic optimization with data mining to allow new classes of problems to be tackled.

Sam Rhynas

With over 20 years of experience in software development, Sam’s focus lies in delivering meaningful, usable & high quality solutions to customer problems. She has a background in QA, Release Management and Service Delivery.

Evolving ideas from agile development processes into ones that apply to data science projects, to strengthen & enhance this process, has contributed to the development of Test Driven Data Analysis within Stochastic Solutions.

Prior to Stochastic Solutions, Sam headed up the Release and Quality operations group at Aridhia, a healthcare analytics start up delivering Software as a Service to the NHS and private health care providers abroad. Additionally, as Product Owner & Project Manager on a number of projects, she delivered innovative solutions using data to address a number of key problem areas, from Primary Care Risk Management of patients to Patient Pathway Management & reporting and an app based real time Symptom Management Alerting System for patients on chemo.

Previous roles included Quadstone, leading the team responsible for QA & development of test & deployment frameworks for interactive tools for data analysis, including predictive behaviour modelling on Big Data.

Simon Brown

Simon Brown has some 30 years experience of software development and data analysis, including particular focus on high-performance large-scale parallel systems. Prior to Stochastic Solutions, Simon worked for Meiko (a UK parallel computer manufacturer), Quadstone and Aridhia.

He believes strongly in the benefits of the collaborative aspects of agile software development, especially pair-programming, test-driven development, and continual evolution through refactoring. He is particularly interested in how these patterns extend from software development into data analysis and data science.

His work at Stochastic Solutions involves a mixture of investigative analysis of client data and development of bespoke services on live streams of data, working closely with client teams. Alongside this, he contributes to the functionality of Miró, Stochastic Solutions' in-house general-purpose data analysis toolset. He is particularly interested in integrations for live real-time deployment of predictive models, and the frameworks based on emerging standards for this.

At Aridhia, an innovative healthcare startup company, Simon headed up the product engineering group, with responsibility for the development of all of Aridhia's software products and services. These projects all involved taking NHS (and other healthcare) data, processing it, and presenting results to clinical users as live web-application services. For example, he implemented systems to deploy analytical models on live NHS Primary Care data to predict emergency hospital admissions and drug prescription safety, including leading the development teams involved and acting as Product Owner and Project Manager.

Previously, he led the analytics software development team at Quadstone, focusing on building interactive tools and frameworks for predictive behaviour modelling on Big Data.

Get In Touch

Where to Find Us

18 Forth Street
Edinburgh
EH1 3LH

Email Us At

info@StochasticSolutions.com

Call us on

Phone: +44 7713 787 602

COMPANY INFORMATION

Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.