This workshop is about data plumbing and the practice of data science. In particular, we’ll discuss how to add tests to your data processing pipeline.
Most people use ad-hoc and implicit “sanity checks” to tell them when transformations have gone off the rails. Borrowing ideas from software development, we’ll talk about ways to make this process more explicit and reproducible.
Questions we’ll address along the way:
– what should these sanity checks look like?
– when should they be run?
– are there tools that can help with this?
– how do these techniques scale?
– can we actually design our data pipelines around such tests?
We’ll spend the second part of the session walking through example data sets/problems and work as a group to identify appropriate sanity checks and their implementations.