The Project


Machine learning is no panacea — if your data are 20% wrong but you are hoping for 90% accuracy to make a profit out of your model, it is highly likely that it is doomed to fail. In our experience, we were surprised by how often an end user has a staggering mismatch between the quality of their data an the expectation of the accuracy that an ML model can achieve.

Our view to this problem is to provide the functionality of an automatic feasibility study to the end user — given a dataset and an target accuracy, we provide the user a best-​effort “belief” on whether it is possible or not, before the users fire up expensive ML processes, just like how many real-​world ML consultants are dealing with their customers. Of course, such a “belief” will never be perfect, but we hope that providing such a signal will help the end users to better calibrate their expectations.

From the technical perspective, what we are estimating is the Bayes error, a fundamental ML concept. In, we designed a simple, yet effective, Bayes error estimator enabled by the recent advancement of representation learning and the increasing availability of pre-​trained feature embeddings.

Input: (1) Augmented, machine readable, dataset; (2) Target accuracy.

Output: {Feasible, Not Feasible} as the belief of the system.

Action: (1) If “Feasible”, proceeds to; (2) If “Not Feasible”, proceeds to CPClean.


2020 Towards Automatic Feasibility Study for Machine Learning Applications
C Renggli, L Rimanić, L Kolar, N Hollenstein, W Wu, C Zhang
[arXiv] arXiv preprint arXiv:2010.08410

In our experience of working with domain experts who are using today’s AutoML systems, a common problem we encountered is what we call “unrealistic expectations” – when users are facing a very challenging task with noisy data acquisition process, whilst being expected to achieve startlingly high accuracy with machine learning (ML). Consequently, many computationally expensive AutoML runs and labour-intensive ML development processes are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present with the goal of preforming an automatic feasibility study before building ML applications or collecting too many samples. A user provides inputs in the form of a dataset, which is representative for the task and data acquisition process, and a quality target (e.g., expected accuracy > 0.8). The system returns its deduction on whether this target is achievable using ML given the input data. We approach this problem by estimating the irreducible error of the underlying task, also known as Bayes error. The technical key contribution of this work is the design of a practical Bayes error estimator. We carefully evaluate the benefits and limitations of running prior to training ML models on too noisy datasets for reaching the desired target accuracy. By including the automatic feasibility study into the iterative label cleaning process, users are able to save substantial labeling time and monetary efforts.

On Convergence of Nearest Neighbor Classifiers over Feature Transformations
L Rimanić, C Renggli, B Li, C Zhang
[NeurIPS] Advances in Neural Information Processing Systems

The k-Nearest Neighbors (kNN) classifier is a fundamental non-parametric machine learning algorithm. However, it is well known that it suffers from the curse of dimensionality, which is why in practice one often applies a kNN classifier on top of a (pre-trained) feature transformation. From a theoretical perspective, most, if not all theoretical results aimed at understanding the kNN classifier are derived for the raw feature space. This leads to an emerging gap between our theoretical understanding of kNN and its practical applications.

In this paper, we take a first step towards bridging this gap. We provide a novel analysis on the convergence rates of a kNN classifier over transformed features. This analysis requires in-depth understanding of the properties that connect both the transformed space and the raw feature space. More precisely, we build our convergence bound upon two key properties of the transformed space: (1) safety – how well can one recover the raw posterior from the transformed space, and (2) smoothness – how complex this recovery function is. Based on our result, we are able to explain why some (pre-trained) feature transformations are better suited for a kNN classifier than others. We empirically validate that both properties have an impact on the kNN convergence on 30 feature transformations with 6 benchmark datasets spanning from the vision to the text domain. in action: Towards automatic feasibility analysis for machine learning application development
C Renggli, L Rimanić, L Kolar, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment

We demonstrate, a data analytics system that performs feasibility analysis for machine learning (ML) applications before they are developed. Given a performance target of an ML application (e.g., accuracy above 0.95), provides a decisive answer to ML developers regarding whether the target is achievable or not. We formulate the feasibility analysis problem as an instance of Bayes error estimation. That is, for a data (distribution) on which the ML application should be performed, provides an estimate of the Bayes error - the minimum error rate that can be achieved by any classifier. It is well-known that estimating the Bayes error is a notoriously hard task. In we explore and employ estimators based on the combination of (1) nearest neighbor (NN) classifiers and (2) pre-trained feature transformations. To the best of our knowledge, this is the first work on Bayes error estimation that combines (1) and (2). In today’s cost-driven business world, feasibility of an ML project is an ideal piece of information for ML application developers - plays the role of a reliable “consultant.”


Evaluating Bayes Error Estimators on Real-World Datasets with FeeBee
C Renggli, L Rimanić, N Hollenstein, C Zhang
[NeurIPS Datasets and Benchmarks] Advances in Neural Information Processing Systems

The Bayes error rate (BER) is a fundamental concept in machine learning that quantifies the best possible accuracy any classifier can achieve on a fixed probability distribution. Despite years of research on building estimators of lower and upper bounds for the BER, these were usually compared only on synthetic datasets with known probability distributions, leaving two key questions unanswered: (1) How well do they perform on realistic, non-synthetic datasets?, and (2) How practical are they? Answering these is not trivial. Apart from the obvious challenge of an unknown BER for real-world datasets, there are two main aspects any BER estimator needs to overcome in order to be applicable in real-world settings: (1) the computational and sample complexity, and (2) the sensitivity and selection of hyper-parameters.In this work, we propose FeeBee, the first principled framework for analyzing and comparing BER estimators on modern real-world datasets with unknown probability distribution. We achieve this by injecting a controlled amount of label noise and performing multiple evaluations on a series of different noise levels, supported by a theoretical result which allows drawing conclusions about the evolution of the BER. By implementing and analyzing 7 multi-class BER estimators on 6 commonly used datasets of the computer vision and NLP domains, FeeBee allows a thorough study of these estimators, clearly identifying strengths and weaknesses of each, whilst being easily deployable on any future BER estimator.