In ease.ml, our belief (which of course is far off from perfect) is that there are two key reasons behind an unsatisfactory quality — (1) the existing data is too noisy and we need to clean it up, and (2) the existing dataset is too small and we need to acquire more.
CPClean tries to provide insights on the impact of data noises to the trained ML model. In principle, it asks the following question — given a noise distribution on each data value (which can be constructed using an ensemble of state-of-the-art data cleaning tools), what’s the highest possible accuracy that an ML model can achieve? This value will provide the user more insights on what to do — if such an accuracy is high enough, the user should conduct data cleaning using state-of-the-art techniques; otherwise, data cleaning might simply be a waste of efforts an the users are probably better off acquiring more data.
The technical challenge here is that there are exponentially many possible combinations of noisy values and how can we find the one with best ML accuracy? We show that, surprisingly, for k-nearest neighbour classifiers, this problem can be solved in polynomial (and sometimes linear) time! By using a k-nearest neighbour classifier as a proxy, CPClean provides users the signal corresponding to the potential of data cleaning.
As a by-product, there is an interesting connection between CPClean and robustness (especially randomised smoothing). We show that it is possible to extend the idea behind CPClean to provide the first provably robust defence to backdoor attacks.
Input: Augmented, machine readable, dataset.
Output: The system’s belief on the best possible KNN accuracy after data cleaning.
Action: (1) If the potential is high, proceeds to state-of-the-art data cleaning tools and rerun the loop starting from ease.ml/DataMagic; (2) If the potential is low, proceeds to Market to acquire more data.
[VLDB] Proceedings of the VLDB Endowment
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of “Certain Predictions” (CP) — a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed — we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of “data cleaning for machine learning (DC for ML).” We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.
[ICDE] IEEE International Conference on Data Engineering
Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML - ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics.We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.