opendataval.dataval.margcontrib package#

Submodules#

opendataval.dataval.margcontrib.banzhaf module#

class opendataval.dataval.margcontrib.banzhaf.DataBanzhaf(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin

Data Banzhaf implementation.

References#

Parameters#

num_modelsint, optional: Number of models to take to compute Banzhaf values, by default 1000
random_stateRandomState, optional: Random initial state, by default None

evaluate_data_values() → ndarray#

Return data values for each training data point.

Compute data values using the Data Banzhaf data valuator. Finds difference of average performance of all sets including data point minus not-including.

Returns#

np.ndarray: Predicted data values/selection for every training data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for Data Banzhaf.

Parameters#

x_traintorch.Tensor: Data covariates
y_traintorch.Tensor: Data labels
x_validtorch.Tensor: Test+Held-out covariates
y_validtorch.Tensor: Test+Held-out labels

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Trains the Data Banzhaf value by sampling from the powerset. We compute average performance of all subsets including and not including a data point.

References#

Parameters#

argstuple[Any], optional: Training positional args
kwargsdict[str, Any], optional: Training key word arguments

class opendataval.dataval.margcontrib.banzhaf.DataBanzhafMargContrib(*args, **kwargs)#

Bases: ShapEvaluator

Data Banzhaf implementation using the marginal contributions.

Data Banzhaf implementation using the ShapEvaluator, which already computes the marginal contributions for other evaluators. This approach may not be as efficient as the previous approach, but is recommended to minimize compute time if you cache a previous computation.

References#

Parameters#

samplerSampler, optional: Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.

compute_weight() → float#

Compute weights for each cardinality of training set.

Banzhaf weights each data point according to the number of combinations of \(j\) cardinality to number of data points

Returns#

np.ndarray: Weights by cardinality of subset

opendataval.dataval.margcontrib.betashap module#

class opendataval.dataval.margcontrib.betashap.BetaShapley(*args, **kwargs)#

Bases: ShapEvaluator

Beta Shapley implementation. Must specify alpha/beta values for beta function.

References#

Parameters#

samplerSampler, optional: Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.
alphaint, optional: Alpha parameter for beta distribution used in the weight function, by default 4
betaint, optional: Beta parameter for beta distribution used in the weight function, by default 1

compute_weight() → ndarray#

Compute weights for each cardinality of training set.

Uses \(\alpha\), \(beta\) are parameters to the beta distribution. [1] BetaShap weight computation, \(j\) is cardinality, Equation (3) and (5).

\[w(j) := \frac{1}{n} w^{(n)}(j) \tbinom{n-1}{j-1} \propto \frac{Beta(j + \beta - 1, n - j + \alpha)}{Beta(\alpha, \beta)} \tbinom{n-1}{j-1}\]

References#

Returns#

np.ndarray: Weights by cardinality of subset

opendataval.dataval.margcontrib.datashap module#

class opendataval.dataval.margcontrib.datashap.DataShapley(*args, **kwargs)#

Bases: ShapEvaluator

Data Shapley implementation.

References#

Parameters#

samplerSampler, optional: Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.

compute_weight() → float#

Compute weights (uniform) for each cardinality of training set.

Shapley values take a simple average of the marginal contributions across all different cardinalities.

Returns#

np.ndarray: Weights by cardinality of subset

opendataval.dataval.margcontrib.loo module#

class opendataval.dataval.margcontrib.loo.LeaveOneOut(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin

Leave One Out data valuation implementation.

References#

Parameters#

random_stateRandomState, optional: Random initial state, by default None

evaluate_data_values() → ndarray#

Compute data values using Leave One Out data valuation.

Returns#

np.ndarray: Predicted data values/selection for training input data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for Leave-One-Out data valuation.

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor: Data labels
x_validtorch.Tensor | Dataset: Test+Held-out covariates
y_validtorch.Tensor: Test+Held-out labels

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Compute the data values using the Leave-One-Out data valuation. Equivalently, LOO can be computed from the marginal contributions as it’s a semivalue.

Parameters#

argstuple[Any], optional: Training positional args
kwargsdict[str, Any], optional: Training key word arguments

opendataval.dataval.margcontrib.sampler module#

class opendataval.dataval.margcontrib.sampler.GrTMCSampler(*args, **kwargs)#

Bases: Sampler

TMC Sampler with terminator for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler.

References#

Parameters#

gr_thresholdfloat, optional: Convergence threshold for the Gelman-Rubin statistic. Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05
max_mc_epochsint, optional: Max number of outer epochs of MCMC sampling, by default 100
models_per_epochint, optional: Number of model fittings to take per epoch prior to checking GR convergence, by default 100
min_modelsint, optional: Minimum samples before checking MCMC convergence, by default 1000
min_cardinalityint, optional: Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional: Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object
random_stateRandomState, optional: Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#: Cached marginal contributions.

GR_MAX = 100#: Default maximum Gelman-Rubin statistic. Used for burn-in.

compute_marginal_contribution(*args, **kwargs)#

Compute the marginal contributions for semivalue based data evaluators.

Computes the marginal contribution by sampling. Checks MCMC convergence every 100 iterations using Gelman-Rubin Statistic. NOTE if the marginal contribution has not been calculated, will look it up in a cache of already trained ShapEvaluators, otherwise will train from scratch.

Parameters#

argstuple[Any], optional: Training positional args
kwargsdict[str, Any], optional: Training key word arguments

Notes#

marginal_increment_array_stacknp.ndarray: Marginal increments when one data point is added.

set_coalition(coalition: Tensor)#: Initializes storage to find marginal contribution of each data point

class opendataval.dataval.margcontrib.sampler.MonteCarloSampler(*args, **kwargs)#

Bases: Sampler

Monte Carlo sampler for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler. We take mc_epochs permutations and compute the marginal contributions. Simplest implementation but the least practical.

Parameters#

mc_epochsint, optional: Number of outer epochs of MCMC sampling, by default 1000
min_cardinalityint, optional: Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional: Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object
random_stateRandomState, optional: Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#: Cached marginal contributions.

compute_marginal_contribution(*args, **kwargs)#

Trains model to predict data values.

Uses permutation sampling to find the marginal contribution of each data point, takes self.mc_epochs number of permutations.

set_coalition(coalition: Tensor)#: Initializes storage to find marginal contribution of each data point

class opendataval.dataval.margcontrib.sampler.Sampler(*args, **kwargs)#

Bases: ABC, ReprMixin

Abstract Sampler class for marginal contribution based data evaluators.

Many marginal contribution based data evaluators depend on a sampling method as they typically can be very computationally expensive. The Sampler class provides a blue print of required methods to be used and the following samplers provide ways of caching computed marginal contributions if given a “cache_name”.

abstract compute_marginal_contribution(*args, **kwargs) → ndarray#

Given args and kwargs for the value func, computes marginal contribution.

Returns#

np.ndarray: Marginal contribution array per data point for each coalition size. Dim 0 is the index of the added data point, Dim 1 is the cardinality when the data point is added.

abstract set_coalition(coalition: Tensor) → Self#

Given the coalition, initializes data structures to compute marginal contrib.

Parameters#

coalitiontorch.Tensor: Coalition of data to compute the marginal contribution of each data point.

set_evaluator(value_func: Callable[[list[int], ...], float])#

Sets the evaluator function to evaluate the utility of a coalition

Parameters#

value_funcCallable[[list[int], …], float]: T his function sets the utility function which computes the utility for a given coalition of indices.

The following is an example of how the api would work in a DataEvaluator:

self.sampler.set_evaluator(self._evaluate_model)

class opendataval.dataval.margcontrib.sampler.TMCSampler(*args, **kwargs)#

Bases: Sampler

TMCShapley sampler for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler.

References#

Parameters#

mc_epochsint, optional: Number of outer epochs of MCMC sampling, by default 1000
min_cardinalityint, optional: Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional: Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object
random_stateRandomState, optional: Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#: Cached marginal contributions.

compute_marginal_contribution(*args, **kwargs)#

Computes marginal contribution through TMC Shapley.

Uses TMC-Shapley sampling to find the marginal contribution of each data point, takes self.mc_epochs number of samples.

set_coalition(coalition: Tensor)#: Initializes storage to find marginal contribution of each data point

opendataval.dataval.margcontrib.shap module#

class opendataval.dataval.margcontrib.shap.ShapEvaluator(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin, ABC

Abstract class for all semivalue-based methods of computing data values.

References#

Attributes#

samplerSampler, optional: Sampler used to compute the marginal contribution, by default uses TMC-Shapley with a Gelman-Rubin statistic terminator. Samplers are found in sampler

Parameters#

samplerSampler, optional: Sampler used to compute the marginal contributions. Can be found in opendataval/margcontrib/sampler.py, by default GrTMCSampler and uses additonal arguments as constructor for sampler.
gr_thresholdfloat, optional: Convergence threshold for the Gelman-Rubin statistic. Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05
max_mc_epochsint, optional: Max number of outer epochs of MCMC sampling, by default 100
models_per_epochint, optional: Number of model fittings to take per epoch prior to checking GR convergence, by default 100
min_modelsint, optional: Minimum samples before checking MCMC convergence, by default 1000
min_cardinalityint, optional: Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional: Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object
random_stateRandomState, optional: Random initial state, by default None

abstract compute_weight() → ndarray#: Compute the weights for each cardinality of training set.

evaluate_data_values() → ndarray#

Return data values for each training data point.

Multiplies the marginal contribution with their respective weights to get data values for semivalue-based estimators

Returns#

np.ndarray: Predicted data values/selection for every input data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for semi-value samplers.

Parameters#

x_traintorch.Tensor: Data covariates
y_traintorch.Tensor: Data labels
x_validtorch.Tensor: Test+Held-out covariates
y_validtorch.Tensor: Test+Held-out labels

train_data_values(*args, **kwargs)#: Uses sampler to trains model to find marginal contribs and data values.

opendataval.dataval.margcontrib package#

Submodules#

opendataval.dataval.margcontrib.banzhaf module#

References#

Parameters#

Returns#

Parameters#

References#

Parameters#

References#

Parameters#

Returns#

opendataval.dataval.margcontrib.betashap module#

References#

Parameters#

References#

Returns#

opendataval.dataval.margcontrib.datashap module#

References#

Parameters#

Returns#

opendataval.dataval.margcontrib.loo module#

References#

Parameters#

Returns#

Parameters#

Parameters#

opendataval.dataval.margcontrib.sampler module#

References#

Parameters#

Parameters#

Notes#

Parameters#

Returns#

Parameters#

Parameters#

References#

Parameters#

opendataval.dataval.margcontrib.shap module#

References#

Attributes#

Parameters#

Returns#

Parameters#

Module contents#