The ease.ml Project

DataScope

If data cleaning won’t increase your accuracy by too much, another potential reason of unsatisfactory ML quality is simply that you don’t have enough amount of data. If CPClean advices the user against data cleaning, she needs to acquire more data. Market is the next ease.ml component that helps the user with this.


We focus on a specific case of data acquisition — given a pool of data examples, how can we choose which one to include (and to label if they are unlabelled) to maximize the accuracy of ML models? In principle, we hope to pick out those data examples that are more “valuable” to the downstream ML models.

In Market, we decide whether a data example is valuable by using the Shapley value, a well established concept in game theory, and treat the accuracy of ML models as the utility. Unfortunately, simply evaluating the Shapley value can be expensive — for generic ML models, it is exponential to the number of data examples. Fortunately, we show that, for k-​nearest neighbour classifiers, calculating the exact Shapley value can be done in nearly linear time! In Market, we use the k-​nearest neighbour classifiers as a proxy and provide the guidance for the user on which new data samples to acquire.

Input: (1) Augmented, machine readable, dataset; (2) A pool of potential data examples that one can acquire.

Output: The system’s belief on the value of each potential data example for the user to acquire.

Action: The user acquire more data and rerun the loop starting from ease.ml/DataMagic.

Publications

2024

Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines
B Karlaš, D Dao, M Interlandi, S Schelter, W Wu, C Zhang
[ICLR] The Twelfth International Conference on Learning Representations
Abstract

When a machine learning (ML) model exhibits poor quality (e.g., poor accuracy or fairness), the problem can often be traced back to errors in the training data. Being able to discover the data examples that are the most likely culprits is a fundamental concern that has received a lot of attention recently. One prominent way to measure “data importance” with respect to model quality is the Shapley value. Unfortunately, existing methods only focus on the ML model in isolation, without considering the broader ML pipeline for data preparation and feature extraction, which appears in the majority of real-world ML code. This presents a major limitation to applying existing methods in practical settings. In this paper, we propose Datascope, a method for efficiently computing Shapley-based data importance over ML pipelines. We introduce several approximations that lead to dramatic improvements in terms of computational speed. Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them.

2021

An empirical and comparative analysis of data valuation with scalable algorithms
R Jia, X Sun, J Xu, C Zhang, B Li, D Song
[CVPR] Conference on Computer Vision and Pattern Recognition
Abstract

Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. One simple idea is to use the leave-one-out error of each training point to indicate its importance. Recent work has also proposed to use the Shapley value, as it defines a unique value distribution scheme that satisfies a set of appealing properties. However, calculating Shapley values is often expensive, which limits its applicability in real-world applications at scale. Multiple heuristics to improve the scalability of calculating Shapley values have been proposed recently, with the potential risk of compromising their utility in real-world applications.

How well do existing data quantification methods perform on existing workflows? How do these methods compare with each other, empirically and theoretically? Must we sacrifice scalability for the utility in these workflows when using these methods? In this paper, we conduct a novel theoretical analysis comparing the utility of different importance quantification methods, and report extensive experimental studies on existing and proposed workflows such as noisy label detection, watermark removal, data summarization, data acquisition, and domain adaptation. We show that Shapley value approximation based on a KNN surrogate over pre-trained feature embeddings obtains comparable utility with existing algorithms while achieving significant scalability improvement, often by orders of magnitude. Our theoretical analysis also justifies its advantage over the leave-one-out error.

2019

Towards efficient data valuation based on the shapley value
R Jia R, D Dao, B Wang, FA Hubis, N Hynes, NM Gürel NM, B Li, C Zhang, D Song, CJ Spanos
[AISTATS] International Conference on Artificial Intelligence and Statistics
Abstract

“How much is my data worth?” is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors and determining prospective compensation when data breaches happen. In this paper, we study the problem of data valuation by utilizing the Shapley value, a popular notion of value which originated in coopoerative game theory. The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. However, the Shapley value often requires exponential time to compute. To meet this challenge, we propose a repertoire of efficient algorithms for approximating the Shapley value. We also demonstrate the value of each training instance for various benchmark datasets.

Efficient task-specific data valuation for nearest neighbor algorithms
R Jia, D Dao, B Wang, FA Hubis, NM Gürel, B Li, C Zhang, CJ Spanos, D Song
[VLDB] Proceedings of the VLDB Endowment
Abstract

Given a data set D containing millions of data points and a data consumer who is willing to pay for X to train a machine learning (ML) model over D, how should we distribute this $X to each data point to reflect its “value”? In this paper, we define the “relative value of data” via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O(2N) model evaluations for exact computation and O(N log N) for (ϵ, δ)-approximation.

In this paper, we focus on one popular family of ML models relying on K-nearest neighbors (KNN). The most surprising result is that for unweighted KNN classifiers and regressors, the Shapley value of all N data points can be computed, exactly, in O(N log N) time - an exponential improvement on computational complexity! Moreover, for (ϵ, δ)-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity O(Nh(ϵ, K) log N) when ϵ is not too small and K is not too large. We empirically evaluate our algorithms on up to 10 million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further.

We then extend our algorithm to other scenarios such as (1) weighed KNN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation. Some of these extensions, although also being improved exponentially, are less practical for exact computation (e.g., O(NK) complexity for weigthed KNN). We thus propose an Monte Carlo approximation algorithm, which is O(N(log N)2/(log K)2) times more efficient than the baseline approximation algorithm.