The ease.ml Project

# Publications

This is a list of publications that resulted from the ease.ml project.

### 2018

ease.ml in action: towards multi-tenant declarative learning services
B Karlaš, J Liu, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment
Abstract

We demonstrate ease.ml, a multi-tenant machine learning service we host at ETH Zurich for various research groups. Unlike existing machine learning services, ease.ml presents a novel architecture that supports multi-tenant, cost-aware model selection that optimizes for minimizing total regrets of all users. Moreover, it provides a novel user interface that enables declarative machine learning at a higher level: Users only need to specify the input/output schemata of their learning tasks and ease.ml can handle the rest. In this demonstration, we present the design principles of ease.ml, highlight the implementation of its key components, and showcase how ease.ml can help ease machine learning tasks that often perplex even experienced users.

Ease.ml: Towards multi-tenant resource sharing for machine learning workloads
T Li, J Zhong, J Liu, W Wu, C Zhang
[VLDB] Proceedings of the VLDB Endowment
Abstract

We present ease.ml, a declarative machine learning service platform. With ease.ml, a user defines the high-level schema of an ML application and submits the task via a Web interface. The system then deals with the rest, such as model selection and data movement. The ultimate question we hope to understand is that, as a “service provider” that manages a shared cluster of machines running machine learning workloads, what is the resource sharing strategy that maximizes the global satisfaction of all our users?

This paper does not completely answer this general question, but focuses on solving the first technical challenge we were facing when trying to build ease.ml. We observe that resource sharing is a critical yet subtle issue in this multi-tenant scenario, as we have to balance between efficiency and fairness. We first formalize the problem that we call multi-tenant model selection, aiming for minimizing the total regret of all users running automatic model selection tasks. We then develop a novel algorithm that combines multi-armed bandits with Bayesian optimization and prove a regret bound under the multi-tenant setting. Finally, we report our evaluation of ease.ml on synthetic data and on two services we are providing to our users, namely, image classification with deep neural networks and binary classification with Azure ML Studio. Our experimental evaluation results show that our proposed solution can be up to 9.8x faster in achieving the same global average accuracy for all users as the two popular heuristics used by our users before ease.ml, and 4.1 x faster than state-of-the-art systems.

### 2019

Towards efficient data valuation based on the shapley value
R Jia R, D Dao, B Wang, FA Hubis, N Hynes, NM Gürel NM, B Li, C Zhang, D Song, CJ Spanos
[AISTATS] International Conference on Artificial Intelligence and Statistics
Abstract

“How much is my data worth?” is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors and determining prospective compensation when data breaches happen. In this paper, we study the problem of data valuation by utilizing the Shapley value, a popular notion of value which originated in coopoerative game theory. The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. However, the Shapley value often requires exponential time to compute. To meet this challenge, we propose a repertoire of efficient algorithms for approximating the Shapley value. We also demonstrate the value of each training instance for various benchmark datasets.

Automl from service provider’s perspective: Multi-device, multi-tenant model selection with gp-ei
C Yu, B Karlaš, J Zhong, C Zhang, J Liu
[AISTATS] 22nd International Conference on Artificial Intelligence and Statistics
Abstract

AutoML has become a popular service that is provided by most leading cloud service providers today. In this paper, we focus on the AutoML problem from the\emph {service provider’s perspective}, motivated by the following practical consideration: When an AutoML service needs to serve {\em multiple users} with {\em multiple devices} at the same time, how can we allocate these devices to users in an efficient way? We focus on GP-EI, one of the most popular algorithms for automatic model selection and hyperparameter tuning, used by systems such as Google Vizer. The technical contribution of this paper is the first multi-device, multi-tenant algorithm for GP-EI that is aware of\emph {multiple} computation devices and multiple users sharing the same set of computation devices. Theoretically, given users and devices, we obtain a regret bound of $O ((\text {\bf {MIU}}(T, K)+ M)\frac {N^ 2}{M})$, where $\text {\bf {MIU}}(T, K)$ refers to the maximal incremental uncertainty up to time for the covariance matrix . Empirically, we evaluate our algorithm on two applications of automatic model selection, and show that our algorithm significantly outperforms the strategy of serving users independently. Moreover, when multiple computation devices are available, we achieve near-linear speedup when the number of users is much larger than the number of devices.

ease.ml/ci and ease.ml/meter in action: towards data management for statistical generalization
C Renggli, FA Hubis, B Karlaš, K Schawinski, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment
Abstract

Developing machine learning (ML) applications is similar to developing traditional software — it is often an iterative process in which developers navigate within a rich space of requirements, design decisions, implementations, empirical quality, and performance. In traditional software development, software engineering is the field of study which provides principled guidelines for this iterative process. However, as of today, the counterpart of “software engineering for ML” is largely missing — developers of ML applications are left with powerful tools (e.g., TensorFlow and PyTorch) but little guidance regarding the development lifecycle itself.In this paper, we view the management of ML development life-cycles from a data management perspective. We demonstrate two closely related systems, ease.ml/ci and ease.ml/meter, that provide some “principled guidelines” for ML application development: ci is a continuous …

Efficient task-specific data valuation for nearest neighbor algorithms
R Jia, D Dao, B Wang, FA Hubis, NM Gürel, B Li, C Zhang, CJ Spanos, D Song
[VLDB] Proceedings of the VLDB Endowment
Abstract

Given a data set D containing millions of data points and a data consumer who is willing to pay for X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its “value”? In this paper, we define the “relative value of data” via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O(2N) model evaluations for exact computation and O(N log N) for (ϵ, δ)-approximation.

In this paper, we focus on one popular family of ML models relying on K-nearest neighbors (KNN). The most surprising result is that for unweighted KNN classifiers and regressors, the Shapley value of all N data points can be computed, exactly, in O(N log N) time - an exponential improvement on computational complexity! Moreover, for (ϵ, δ)-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity O(Nh(ϵ, K) log N) when ϵ is not too small and K is not too large. We empirically evaluate our algorithms on up to 10 million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further.

We then extend our algorithm to other scenarios such as (1) weighed KNN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation. Some of these extensions, although also being improved exponentially, are less practical for exact computation (e.g., O(NK) complexity for weigthed KNN). We thus propose an Monte Carlo approximation algorithm, which is O(N(log N)2/(log K)2) times more efficient than the baseline approximation algorithm.

Continuous Integration of Machine Learning Models with ease.ml/ci: A Rigorous Yet Practical Treatment
C Renggli, B Karlaš, B Ding, F Liu, K Schawinski, W Wu, C Zhang
[MLSYS] Proceedings of Machine Learning and Systems
Abstract

Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference — it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.

### 2020

CleanML: a study for evaluating the impact of data cleaning on ml classification tasks
P Li, X Rao, J Blase, Y Zhang, X Chu, C Zhang
[ICDE] IEEE International Conference on Data Engineering
Abstract

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML - ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics.We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.

Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions
B Karlaš, P Li, R Wu, NM Gürel, X Chu, W Wu, C Zhang
[VLDB] Proceedings of the VLDB Endowment
Abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of “Certain Predictions” (CP) — a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed — we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of “data cleaning for machine learning (DC for ML).” We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

Building continuous integration services for machine learning
B Karlaš, M Interlandi, C Renggli, W Wu, C Zhang, DMI Babu, J Edwards, C Lauren, A Xu, M Weimer
[SIGKDD] Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
Abstract

Continuous integration (CI) has been a de facto standard for building industrial-strength software. Yet, there is little attention towards applying CI to the development of machine learning (ML) applications until the very recent effort on the theoretical side. In this paper, we take a step forward to bring the theory into practice.

Ease.ml/snoopy in action: Towards automatic feasibility analysis for machine learning application development
C Renggli, L Rimanić, L Kolar, W Wu, C Zhang
[VLDB Demo] Proceedings of the VLDB Endowment
Abstract

We demonstrate ease.ml/snoopy, a data analytics system that performs feasibility analysis for machine learning (ML) applications before they are developed. Given a performance target of an ML application (e.g., accuracy above 0.95), ease.ml/snoopy provides a decisive answer to ML developers regarding whether the target is achievable or not. We formulate the feasibility analysis problem as an instance of Bayes error estimation. That is, for a data (distribution) on which the ML application should be performed, ease.ml/snoopy provides an estimate of the Bayes error - the minimum error rate that can be achieved by any classifier. It is well-known that estimating the Bayes error is a notoriously hard task. In ease.ml/snoopy we explore and employ estimators based on the combination of (1) nearest neighbor (NN) classifiers and (2) pre-trained feature transformations. To the best of our knowledge, this is the first work on Bayes error estimation that combines (1) and (2). In today’s cost-driven business world, feasibility of an ML project is an ideal piece of information for ML application developers - ease.ml/snoopy plays the role of a reliable “consultant.”

Control, Generate, Augment: A Scalable Framework for Multi-​Attribute Text Generation
G Russo, N Hollenstein, C Musat, C Zhang
[EMNLP] Findings of the Association for Computational Linguistics
Abstract

We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.

On Convergence of Nearest Neighbor Classifiers over Feature Transformations
L Rimanić, C Renggli, B Li, C Zhang
[NeurIPS] Advances in Neural Information Processing Systems
Abstract

The k-Nearest Neighbors (kNN) classifier is a fundamental non-parametric machine learning algorithm. However, it is well known that it suffers from the curse of dimensionality, which is why in practice one often applies a kNN classifier on top of a (pre-trained) feature transformation. From a theoretical perspective, most, if not all theoretical results aimed at understanding the kNN classifier are derived for the raw feature space. This leads to an emerging gap between our theoretical understanding of kNN and its practical applications.

In this paper, we take a first step towards bridging this gap. We provide a novel analysis on the convergence rates of a kNN classifier over transformed features. This analysis requires in-depth understanding of the properties that connect both the transformed space and the raw feature space. More precisely, we build our convergence bound upon two key properties of the transformed space: (1) safety – how well can one recover the raw posterior from the transformed space, and (2) smoothness – how complex this recovery function is. Based on our result, we are able to explain why some (pre-trained) feature transformations are better suited for a kNN classifier than others. We empirically validate that both properties have an impact on the kNN convergence on 30 feature transformations with 6 benchmark datasets spanning from the vision to the text domain.

Ease.ml/snoopy: Towards Automatic Feasibility Study for Machine Learning Applications
C Renggli, L Rimanić, L Kolar, N Hollenstein, W Wu, C Zhang
[arXiv] arXiv preprint arXiv:2010.08410
Abstract

In our experience of working with domain experts who are using today’s AutoML systems, a common problem we encountered is what we call “unrealistic expectations” – when users are facing a very challenging task with noisy data acquisition process, whilst being expected to achieve startlingly high accuracy with machine learning (ML). Consequently, many computationally expensive AutoML runs and labour-intensive ML development processes are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present ease.ml/snoopy with the goal of preforming an automatic feasibility study before building ML applications or collecting too many samples. A user provides inputs in the form of a dataset, which is representative for the task and data acquisition process, and a quality target (e.g., expected accuracy > 0.8). The system returns its deduction on whether this target is achievable using ML given the input data. We approach this problem by estimating the irreducible error of the underlying task, also known as Bayes error. The technical key contribution of this work is the design of a practical Bayes error estimator. We carefully evaluate the benefits and limitations of running ease.ml/snoopy prior to training ML models on too noisy datasets for reaching the desired target accuracy. By including the automatic feasibility study into the iterative label cleaning process, users are able to save substantial labeling time and monetary efforts.

Docparser: Hierarchical structure parsing of document renderings
J Rausch, O Martinez, F Bissig, C Zhang, S Feuerriegel
[AAAI] Proceedings of the AAAI Conference on Artificial Intelligence
Abstract

Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed “DocParser”: an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.

Efficient Automatic CASH via Rising Bandits
Y Li, J Jiang, J Gao, Y Shao, C Zhang, B Cui
[AAAI] Proceedings of the AAAI Conference on Artificial Intelligence
Abstract

The Combined Algorithm Selection and Hyperparameter optimization (CASH) is one of the most fundamental problems in Automatic Machine Learning (AutoML). The existing Bayesian optimization (BO) based solutions turn the CASH problem into a Hyperparameter Optimization (HPO) problem by combining the hyperparameters of all machine learning (ML) algorithms, and use BO methods to solve it. As a result, these methods suffer from the low-efficiency problem due to the huge hyperparameter space in CASH. To alleviate this issue, we propose the alternating optimization framework, where the HPO problem for each ML algorithm and the algorithm selection problem are optimized alternately. In this framework, the BO methods are used to solve the HPO problem for each ML algorithm separately, incorporating a much smaller hyperparameter space for BO methods. Furthermore, we introduce Rising Bandits, a CASH-oriented Multi-Armed Bandits (MAB) variant, to model the algorithm selection in CASH. This framework can take the advantages of both BO in solving the HPO problem with a relatively small hyperparameter space and the MABs in accelerating the algorithm selection. Moreover, we further develop an efficient online algorithm to solve the Rising Bandits with provably theoretical guarantees. The extensive experiments on 30 OpenML datasets demonstrate the superiority of the proposed approach over the competitive baselines.

TextNAS: A neural architecture search space tailored for text representation
Y Wang, Y Yang, Y Chen, J Bai, C Zhang, G Su, X Kou, Y Tong, M Yang, L Zhou
[AAAI] Proceedings of the AAAI Conference on Artificial Intelligence
Abstract

Learning text representation is crucial for text classification and other language related tasks. There are a diverse set of text representation networks in the literature, and how to find the optimal one is a non-trivial problem. Recently, the emerging Neural Architecture Search (NAS) techniques have demonstrated good potential to solve the problem. Nevertheless, most of the existing works of NAS focus on the search algorithms and pay little attention to the search space. In this paper, we argue that the search space is also an important human prior to the success of NAS in different applications. Thus, we propose a novel search space tailored for text representation. Through automatic search, the discovered network architecture outperforms state-of-the-art models on various public datasets on text classification and natural language inference tasks. Furthermore, some of the design principles found in the automatic network agree well with human intuition.

### 2021

Ease.ML: A Lifecycle Management System for MLDev and MLOps
LA Melgar, D Dao, S Gan, NM Gürel, N Hollenstein, J Jiang, B Karlaš, T Lemmin, T Li, Y Li, S Rao, J Rausch, C Renggli, L Rimanic, M Weber, S Zhang, Z Zhao, K Schawinski, W Wu, C Zhang
[CIDR] Conference on Innovative Data Systems Research
Abstract

We present Ease.ML, a lifecycle management system for machine learning (ML). Unlike many existing works, which focus on improving individual steps during the lifecycle of ML application development, Ease.ML focuses on managing and automating the entire lifecycle itself. We present user scenarios that have motivated the development of Ease.ML, the eight-step Ease.ML process that covers the lifecycle of ML application development; the foundation of Ease.ML in terms of a probabilistic database model and its connection to information theory; and our lessons learned, which hopefully can inspire future research.

VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition
Y Li, Y Shen, W Zhang, J Jiang, B Ding, Y Li, J Zhou, Z Yang, W Wu, C Zhang, B Cui
[VLDB] Proceedings of the VLDB Endowment
Abstract

End-to-end AutoML has attracted intensive interests from both academia and industry, which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model - akin to the one supported by modern database systems - to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.

An empirical and comparative analysis of data valuation with scalable algorithms
R Jia, X Sun, J Xu, C Zhang, B Li, D Song
[CVPR] Conference on Computer Vision and Pattern Recognition
Abstract

Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. One simple idea is to use the leave-one-out error of each training point to indicate its importance. Recent work has also proposed to use the Shapley value, as it defines a unique value distribution scheme that satisfies a set of appealing properties. However, calculating Shapley values is often expensive, which limits its applicability in real-world applications at scale. Multiple heuristics to improve the scalability of calculating Shapley values have been proposed recently, with the potential risk of compromising their utility in real-world applications.

How well do existing data quantification methods perform on existing workflows? How do these methods compare with each other, empirically and theoretically? Must we sacrifice scalability for the utility in these workflows when using these methods? In this paper, we conduct a novel theoretical analysis comparing the utility of different importance quantification methods, and report extensive experimental studies on existing and proposed workflows such as noisy label detection, watermark removal, data summarization, data acquisition, and domain adaptation. We show that Shapley value approximation based on a KNN surrogate over pre-trained feature embeddings obtains comparable utility with existing algorithms while achieving significant scalability improvement, often by orders of magnitude. Our theoretical analysis also justifies its advantage over the leave-one-out error.

Evaluating Bayes Error Estimators on Real-World Datasets with FeeBee
C Renggli, L Rimanić, N Hollenstein, C Zhang
[NeurIPS Datasets and Benchmarks] Advances in Neural Information Processing Systems
Abstract

The Bayes error rate (BER) is a fundamental concept in machine learning that quantifies the best possible accuracy any classifier can achieve on a fixed probability distribution. Despite years of research on building estimators of lower and upper bounds for the BER, these were usually compared only on synthetic datasets with known probability distributions, leaving two key questions unanswered: (1) How well do they perform on realistic, non-synthetic datasets?, and (2) How practical are they? Answering these is not trivial. Apart from the obvious challenge of an unknown BER for real-world datasets, there are two main aspects any BER estimator needs to overcome in order to be applicable in real-world settings: (1) the computational and sample complexity, and (2) the sensitivity and selection of hyper-parameters.In this work, we propose FeeBee, the first principled framework for analyzing and comparing BER estimators on modern real-world datasets with unknown probability distribution. We achieve this by injecting a controlled amount of label noise and performing multiple evaluations on a series of different noise levels, supported by a theoretical result which allows drawing conclusions about the evolution of the BER. By implementing and analyzing 7 multi-class BER estimators on 6 commonly used datasets of the computer vision and NLP domains, FeeBee allows a thorough study of these estimators, clearly identifying strengths and weaknesses of each, whilst being easily deployable on any future BER estimator.

Online Active Model Selection for Pre-trained Classifiers
MR Karimi, NM Gürel, B Karlaš, J Rausch, C Zhang, A Krause
[AISTATS] International Conference on Artificial Intelligence and Statistics
Abstract

Given pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can also be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies.

### 2022

Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines
B Karlaš, D Dao, M Interlandi, B Li, S Schelter, W Wu, C Zhang
[arXiv] arXiv preprint arXiv:2204.11131
Abstract

Developing modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training – “Which training examples are ‘guilty’ in making the trained ML model predictions inaccurate or unfair?” Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to compute the Shapley value of each training example with respect to utilities such as validation accuracy and fairness of the trained ML model. Unfortunately, despite recent intensive interest and research, existing methods only consider a single ML model “in isolation” and do not consider an end-to-end ML pipeline that consists of data transformations, feature extractors, and ML training.

We present DataScope (ease.ml/datascope), the first system that efficiently computes Shapley values of training examples over an end-to-end ML pipeline, and illustrate its applications in data debugging for ML training. To this end, we first develop a novel algorithmic framework that computes Shapley value over a specific family of ML pipelines that we call canonical pipelines: a positive relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We show that, for many subfamilies of canonical pipelines, computing Shapley value is in PTIME, contrasting the exponential complexity of computing Shapley value in general. We then put this to practice – given an sklearn pipeline, we approximate it with a canonical pipeline to use as a proxy. We conduct extensive experiments illustrating different use cases and utilities. Our results show that DataScope is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging.