opendataval.dataval package#
Subpackages#
- opendataval.dataval.ame package
- opendataval.dataval.csshap package
- opendataval.dataval.dvrl package
- opendataval.dataval.influence package
- opendataval.dataval.knnshap package
- opendataval.dataval.lava package
- opendataval.dataval.margcontrib package
- Submodules
- opendataval.dataval.margcontrib.banzhaf module
- opendataval.dataval.margcontrib.betashap module
- opendataval.dataval.margcontrib.datashap module
- opendataval.dataval.margcontrib.loo module
- opendataval.dataval.margcontrib.sampler module
- opendataval.dataval.margcontrib.shap module
- Module contents
- opendataval.dataval.oob package
- opendataval.dataval.random package
- opendataval.dataval.volume package
Submodules#
opendataval.dataval.api module#
- class opendataval.dataval.api.DataEvaluator(*args, **kwargs)#
Bases:
ABC
,ReprMixin
Abstract class of Data Evaluators. Facilitates Data Evaluation computation.
The following is an example of how the api would work:
dataval = ( DataEvaluator(*args, **kwargs) .input_data(x_train, y_train, x_valid, y_valid) .train_data_values(batch_size, epochs) .evaluate_data_values() )
Parameters#
- random_stateRandomState, optional
Random initial state, by default None
- argstuple[Any]
DavaEvaluator positional arguments
- kwargsDict[str, Any]
DavaEvaluator key word arguments
Attributes#
- pred_modelModel
Prediction model to find how much each training datum contributes towards it.
- data_values: np.array
Cached data values, used by
opendataval.experiment.exper_methods
- Evaluators: ClassVar[dict[str, Self]] = {'ame': <class 'opendataval.dataval.ame.ame.AME'>, 'baggingevaluator': <class 'opendataval.dataval.ame.ame.BaggingEvaluator'>, 'betashapley': <class 'opendataval.dataval.margcontrib.betashap.BetaShapley'>, 'classwiseshapley': <class 'opendataval.dataval.csshap.csshap.ClassWiseShapley'>, 'databanzhaf': <class 'opendataval.dataval.margcontrib.banzhaf.DataBanzhaf'>, 'databanzhafmargcontrib': <class 'opendataval.dataval.margcontrib.banzhaf.DataBanzhafMargContrib'>, 'dataoob': <class 'opendataval.dataval.oob.oob.DataOob'>, 'datashapley': <class 'opendataval.dataval.margcontrib.datashap.DataShapley'>, 'dvrl': <class 'opendataval.dataval.dvrl.dvrl.DVRL'>, 'influencefunction': <class 'opendataval.dataval.influence.influence.InfluenceFunction'>, 'influencesubsample': <class 'opendataval.dataval.influence.infsub.InfluenceSubsample'>, 'knnshapley': <class 'opendataval.dataval.knnshap.knnshap.KNNShapley'>, 'lavaevaluator': <class 'opendataval.dataval.lava.lava.LavaEvaluator'>, 'leaveoneout': <class 'opendataval.dataval.margcontrib.loo.LeaveOneOut'>, 'randomevaluator': <class 'opendataval.dataval.random.random.RandomEvaluator'>, 'robustvolumeshapley': <class 'opendataval.dataval.volume.rvs.RobustVolumeShapley'>, 'shapevaluator': <class 'opendataval.dataval.margcontrib.shap.ShapEvaluator'>}#
- property data_values: ndarray#
Cached data values.
- abstract evaluate_data_values() ndarray #
Return data values for each training data point.
Returns#
- np.ndarray
Predicted data values/selection for training input data point
- input_data(x_train: Tensor | Dataset, y_train: Tensor, x_valid: Tensor | Dataset, y_valid: Tensor)#
Store and transform input data for DataEvaluator.
Parameters#
- x_traintorch.Tensor
Data covariates
- y_traintorch.Tensor
Data labels
- x_validtorch.Tensor
Test+Held-out covariates
- y_validtorch.Tensor
Test+Held-out labels
Returns#
- selfobject
Returns a Data Evaluator.
- input_fetcher(fetcher: DataFetcher)#
Input data from a DataFetcher object. Alternative way of adding data.
- setup(fetcher: DataFetcher, pred_model: Model | None = None, metric: Callable[[Tensor, Tensor], float] | None = None)#
Inputs model, metric and data into Data Evaluator.
Parameters#
- fetcherDataFetcher
DataFetcher containing the training and validation data set.
- pred_modelModel, optional
Prediction model, not required if the DataFetcher is Model less
- metricCallable[[torch.Tensor, torch.Tensor], float]
Evaluation function to determine prediction model performance, by default None and assigns either -MSE or ACC depending if categorical
- argstuple[Any], optional
Training positional args
- kwargsdict[str, Any], optional
Training key word arguments
Returns#
- selfobject
Returns a Data Evaluator.
- train(fetcher: DataFetcher, pred_model: Model | None = None, metric: Callable[[Tensor, Tensor], float] | None = None, *args, **kwargs)#
Store and transform data, then train model to predict data values.
Trains the Data Evaluator and the underlying prediction model. Wrapper for
self.input_data
andself.train_data_values
under one method.Parameters#
- fetcherDataFetcher
DataFetcher containing the training and validation data set.
- pred_modelModel, optional
Prediction model, not required if the DataFetcher is Model less
- metricCallable[[torch.Tensor, torch.Tensor], float]
Evaluation function to determine prediction model performance, by default None and assigns either -MSE or ACC depending if categorical
- argstuple[Any], optional
Training positional args
- kwargsdict[str, Any], optional
Training key word arguments
Returns#
- selfobject
Returns a Data Evaluator.
- class opendataval.dataval.api.ModelLessMixin#
Bases:
object
Mixin for DataEvaluators without a prediction model and use embeddings.
Using embeddings and then predictiong the data values has been used by Ruoxi Jia Group with their KNN Shapley and LAVA data evaluators.
References#
Attributes#
- embedding_modelModel
Embedding model used by model-less DataEvaluator to compute the data values for the embeddings and not the raw input.
- pred_modelModel
The pred_model is unused for training, but to compare a series of models on the same algorithim, we compare against a shared prediction algorithim.
- class opendataval.dataval.api.ModelMixin#
Bases:
object
- evaluate(y: Tensor, y_hat: Tensor)#
Evaluate performance of the specified metric between label and predictions.
Moves input tensors to cpu because of certain bugs/errors that arise when the tensors are not on the same device
Parameters#
- ytorch.Tensor
Labels to be evaluate performance of predictions
- y_hattorch.Tensor
Predictions of labels
Returns#
- float
Performance metric
- input_metric(metric: Callable[[Tensor, Tensor], float])#
Input the evaluation metric.
Parameters#
- metricCallable[[torch.Tensor, torch.Tensor], float]
Evaluation function to determine prediction model performance
- input_model(pred_model: Model)#
Input the prediction model.
Parameters#
- pred_modelModel
Prediction model
- input_model_metric(pred_model: Model, metric: Callable[[Tensor, Tensor], float])#
Input the prediction model and the evaluation metric.
Parameters#
- pred_modelModel
Prediction model
- metricCallable[[torch.Tensor, torch.Tensor], float]
Evaluation function to determine prediction model performance
Returns#
- selfobject
Returns a Data Evaluator.
Module contents#
Create DataEvaluator
to quantify the value of data.
Data Evaluator#
Provides an ABC for DataEvaluator to inherit from. The work flow is as follows:
Register
,
DataFetcher
-> DataEvaluator
-> exper_methods
Catalog#
|
Abstract class of Data Evaluators. |
Mixin for DataEvaluators without a prediction model and use embeddings. |
|
|
Implementation of Average Marginal Effect Data Valuation. |
|
Data valuation using reinforcement learning class, implemented with PyTorch. |
|
Influence Function Data evaluation implementation. |
|
Influence computed through subsamples implementation. |
|
Data valuation using KNNShapley implementation. |
|
Data Out-of-Bag data valuation implementation. |
|
Data Banzhaf implementation. |
|
Beta Shapley implementation. |
|
Data Shapley implementation. |
|
Data valuation using LAVA implementation. |
|
Leave One Out data valuation implementation. |
|
Abstract class for all semivalue-based methods of computing data values. |
|
Completely Random DataEvaluator for baseline comparison purposes. |
|
Robust Volume Shapley and Volume Shapley data valuation implementation. |
|
Abstract Sampler class for marginal contribution based data evaluators. |
|
TMCShapley sampler for semivalue-based methods of computing data values. |
|
TMC Sampler with terminator for semivalue-based methods of computing data values. |