Assistant Professor and head of the Big Data Management Group at TU Berlin
A Holistic Approach for Effective Error Detection
Data cleaning is one of most time-consuming and tedious tasks in most data-driven tasks. Typically, it entails the identification of erroneous values and their correction. Effective error detection can significantly improve the successive correction step. Research in error detection has provided a variety of approaches, most of which require some prior knowledge about the dataset in order to select and configure the approach with rules and sensitivity thresholds. Often these approaches only cover a certain type of errors. Recently, novel machine learning techniques have been proposed to treat error detection as a classification task. These approaches still require large amounts of training data scaling with the size of the dataset to cover the variety of residing error types inside a dataset. In this talk, I will present our work in progress towards a holistic error detection system that significantly reduces the amount of required labels by leveraging label propagation techniques and meta-learning. In a nutshell, we leverage existing error detection techniques as feature generators. First I discuss how manually configured off-the-shelf error detection techniques can be aggregated and automatically selected. Then I show, both approaches can be combined and refined for a configuration-free error detection system that only requires about 20 labeled tuples to outperform state-of-the-art techniques.