opendataval.dataval package#

Subpackages#

Submodules#

opendataval.dataval.api module#

class opendataval.dataval.api.DataEvaluator(*args, **kwargs)#

Bases: ABC, ReprMixin

Abstract class of Data Evaluators. Facilitates Data Evaluation computation.

The following is an example of how the api would work:

dataval = (
    DataEvaluator(*args, **kwargs)
    .input_data(x_train, y_train, x_valid, y_valid)
    .train_data_values(batch_size, epochs)
    .evaluate_data_values()
)

Parameters#

random_stateRandomState, optional

Random initial state, by default None

argstuple[Any]

DavaEvaluator positional arguments

kwargsDict[str, Any]

DavaEvaluator key word arguments

Attributes#

pred_modelModel

Prediction model to find how much each training datum contributes towards it.

data_values: np.array

Cached data values, used by opendataval.experiment.exper_methods

Evaluators: ClassVar[dict[str, Self]] = {'ame': <class 'opendataval.dataval.ame.ame.AME'>, 'baggingevaluator': <class 'opendataval.dataval.ame.ame.BaggingEvaluator'>, 'betashapley': <class 'opendataval.dataval.margcontrib.betashap.BetaShapley'>, 'classwiseshapley': <class 'opendataval.dataval.csshap.csshap.ClassWiseShapley'>, 'databanzhaf': <class 'opendataval.dataval.margcontrib.banzhaf.DataBanzhaf'>, 'databanzhafmargcontrib': <class 'opendataval.dataval.margcontrib.banzhaf.DataBanzhafMargContrib'>, 'dataoob': <class 'opendataval.dataval.oob.oob.DataOob'>, 'datashapley': <class 'opendataval.dataval.margcontrib.datashap.DataShapley'>, 'dvrl': <class 'opendataval.dataval.dvrl.dvrl.DVRL'>, 'influencefunction': <class 'opendataval.dataval.influence.influence.InfluenceFunction'>, 'influencesubsample': <class 'opendataval.dataval.influence.infsub.InfluenceSubsample'>, 'knnshapley': <class 'opendataval.dataval.knnshap.knnshap.KNNShapley'>, 'lavaevaluator': <class 'opendataval.dataval.lava.lava.LavaEvaluator'>, 'leaveoneout': <class 'opendataval.dataval.margcontrib.loo.LeaveOneOut'>, 'randomevaluator': <class 'opendataval.dataval.random.random.RandomEvaluator'>, 'robustvolumeshapley': <class 'opendataval.dataval.volume.rvs.RobustVolumeShapley'>, 'shapevaluator': <class 'opendataval.dataval.margcontrib.shap.ShapEvaluator'>}#
property data_values: ndarray#

Cached data values.

abstract evaluate_data_values() ndarray#

Return data values for each training data point.

Returns#

np.ndarray

Predicted data values/selection for training input data point

input_data(x_train: Tensor | Dataset, y_train: Tensor, x_valid: Tensor | Dataset, y_valid: Tensor)#

Store and transform input data for DataEvaluator.

Parameters#

x_traintorch.Tensor

Data covariates

y_traintorch.Tensor

Data labels

x_validtorch.Tensor

Test+Held-out covariates

y_validtorch.Tensor

Test+Held-out labels

Returns#

selfobject

Returns a Data Evaluator.

input_fetcher(fetcher: DataFetcher)#

Input data from a DataFetcher object. Alternative way of adding data.

setup(fetcher: DataFetcher, pred_model: Model | None = None, metric: Callable[[Tensor, Tensor], float] | None = None)#

Inputs model, metric and data into Data Evaluator.

Parameters#

fetcherDataFetcher

DataFetcher containing the training and validation data set.

pred_modelModel, optional

Prediction model, not required if the DataFetcher is Model less

metricCallable[[torch.Tensor, torch.Tensor], float]

Evaluation function to determine prediction model performance, by default None and assigns either -MSE or ACC depending if categorical

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

Returns#

selfobject

Returns a Data Evaluator.

train(fetcher: DataFetcher, pred_model: Model | None = None, metric: Callable[[Tensor, Tensor], float] | None = None, *args, **kwargs)#

Store and transform data, then train model to predict data values.

Trains the Data Evaluator and the underlying prediction model. Wrapper for self.input_data and self.train_data_values under one method.

Parameters#

fetcherDataFetcher

DataFetcher containing the training and validation data set.

pred_modelModel, optional

Prediction model, not required if the DataFetcher is Model less

metricCallable[[torch.Tensor, torch.Tensor], float]

Evaluation function to determine prediction model performance, by default None and assigns either -MSE or ACC depending if categorical

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

Returns#

selfobject

Returns a Data Evaluator.

abstract train_data_values(*args, **kwargs)#

Trains model to predict data values.

Parameters#

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

Returns#

selfobject

Returns a trained Data Evaluator.

class opendataval.dataval.api.ModelLessMixin#

Bases: object

Mixin for DataEvaluators without a prediction model and use embeddings.

Using embeddings and then predictiong the data values has been used by Ruoxi Jia Group with their KNN Shapley and LAVA data evaluators.

References#

Attributes#

embedding_modelModel

Embedding model used by model-less DataEvaluator to compute the data values for the embeddings and not the raw input.

pred_modelModel

The pred_model is unused for training, but to compare a series of models on the same algorithim, we compare against a shared prediction algorithim.

embeddings(*tensors: tuple[Dataset | Tensor, ...]) tuple[Tensor, ...]#

Returns Embeddings for the input tensors

Returns#

tuple[torch.Tensor, …]

Returns tupple of tensors equal to the number of tensors input

class opendataval.dataval.api.ModelMixin#

Bases: object

evaluate(y: Tensor, y_hat: Tensor)#

Evaluate performance of the specified metric between label and predictions.

Moves input tensors to cpu because of certain bugs/errors that arise when the tensors are not on the same device

Parameters#

ytorch.Tensor

Labels to be evaluate performance of predictions

y_hattorch.Tensor

Predictions of labels

Returns#

float

Performance metric

input_metric(metric: Callable[[Tensor, Tensor], float])#

Input the evaluation metric.

Parameters#

metricCallable[[torch.Tensor, torch.Tensor], float]

Evaluation function to determine prediction model performance

input_model(pred_model: Model)#

Input the prediction model.

Parameters#

pred_modelModel

Prediction model

input_model_metric(pred_model: Model, metric: Callable[[Tensor, Tensor], float])#

Input the prediction model and the evaluation metric.

Parameters#

pred_modelModel

Prediction model

metricCallable[[torch.Tensor, torch.Tensor], float]

Evaluation function to determine prediction model performance

Returns#

selfobject

Returns a Data Evaluator.

Module contents#

Create DataEvaluator to quantify the value of data.

Data Evaluator#

Provides an ABC for DataEvaluator to inherit from. The work flow is as follows: Register, DataFetcher -> DataEvaluator -> exper_methods

Catalog#

DataEvaluator(*args, **kwargs)

Abstract class of Data Evaluators.

ModelMixin()

ModelLessMixin()

Mixin for DataEvaluators without a prediction model and use embeddings.

AME(*args, **kwargs)

Implementation of Average Marginal Effect Data Valuation.

DVRL(*args, **kwargs)

Data valuation using reinforcement learning class, implemented with PyTorch.

InfluenceFunction(*args, **kwargs)

Influence Function Data evaluation implementation.

InfluenceSubsample(*args, **kwargs)

Influence computed through subsamples implementation.

KNNShapley(*args, **kwargs)

Data valuation using KNNShapley implementation.

DataOob(*args, **kwargs)

Data Out-of-Bag data valuation implementation.

DataBanzhaf(*args, **kwargs)

Data Banzhaf implementation.

BetaShapley(*args, **kwargs)

Beta Shapley implementation.

DataShapley(*args, **kwargs)

Data Shapley implementation.

LavaEvaluator(*args, **kwargs)

Data valuation using LAVA implementation.

LeaveOneOut(*args, **kwargs)

Leave One Out data valuation implementation.

ShapEvaluator(*args, **kwargs)

Abstract class for all semivalue-based methods of computing data values.

RandomEvaluator(*args, **kwargs)

Completely Random DataEvaluator for baseline comparison purposes.

RobustVolumeShapley(*args, **kwargs)

Robust Volume Shapley and Volume Shapley data valuation implementation.

Sampler(*args, **kwargs)

Abstract Sampler class for marginal contribution based data evaluators.

TMCSampler(*args, **kwargs)

TMCShapley sampler for semivalue-based methods of computing data values.

GrTMCSampler(*args, **kwargs)

TMC Sampler with terminator for semivalue-based methods of computing data values.