opendataval.dataval.margcontrib package#

Submodules#

opendataval.dataval.margcontrib.banzhaf module#

class opendataval.dataval.margcontrib.banzhaf.DataBanzhaf(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin

Data Banzhaf implementation.

References#

Parameters#

num_modelsint, optional

Number of models to take to compute Banzhaf values, by default 1000

random_stateRandomState, optional

Random initial state, by default None

evaluate_data_values() ndarray#

Return data values for each training data point.

Compute data values using the Data Banzhaf data valuator. Finds difference of average performance of all sets including data point minus not-including.

Returns#

np.ndarray

Predicted data values/selection for every training data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for Data Banzhaf.

Parameters#

x_traintorch.Tensor

Data covariates

y_traintorch.Tensor

Data labels

x_validtorch.Tensor

Test+Held-out covariates

y_validtorch.Tensor

Test+Held-out labels

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Trains the Data Banzhaf value by sampling from the powerset. We compute average performance of all subsets including and not including a data point.

References#

Parameters#

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

class opendataval.dataval.margcontrib.banzhaf.DataBanzhafMargContrib(*args, **kwargs)#

Bases: ShapEvaluator

Data Banzhaf implementation using the marginal contributions.

Data Banzhaf implementation using the ShapEvaluator, which already computes the marginal contributions for other evaluators. This approach may not be as efficient as the previous approach, but is recommended to minimize compute time if you cache a previous computation.

References#

Parameters#

samplerSampler, optional

Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.

compute_weight() float#

Compute weights for each cardinality of training set.

Banzhaf weights each data point according to the number of combinations of \(j\) cardinality to number of data points

Returns#

np.ndarray

Weights by cardinality of subset

opendataval.dataval.margcontrib.betashap module#

class opendataval.dataval.margcontrib.betashap.BetaShapley(*args, **kwargs)#

Bases: ShapEvaluator

Beta Shapley implementation. Must specify alpha/beta values for beta function.

References#

Parameters#

samplerSampler, optional

Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.

alphaint, optional

Alpha parameter for beta distribution used in the weight function, by default 4

betaint, optional

Beta parameter for beta distribution used in the weight function, by default 1

compute_weight() ndarray#

Compute weights for each cardinality of training set.

Uses \(\alpha\), \(beta\) are parameters to the beta distribution. [1] BetaShap weight computation, \(j\) is cardinality, Equation (3) and (5).

\[w(j) := \frac{1}{n} w^{(n)}(j) \tbinom{n-1}{j-1} \propto \frac{Beta(j + \beta - 1, n - j + \alpha)}{Beta(\alpha, \beta)} \tbinom{n-1}{j-1}\]

References#

Returns#

np.ndarray

Weights by cardinality of subset

opendataval.dataval.margcontrib.datashap module#

class opendataval.dataval.margcontrib.datashap.DataShapley(*args, **kwargs)#

Bases: ShapEvaluator

Data Shapley implementation.

References#

Parameters#

samplerSampler, optional

Sampler used to compute the marginal contributions. Can be found in sampler, by default uses *args, **kwargs for GrTMCSampler.

compute_weight() float#

Compute weights (uniform) for each cardinality of training set.

Shapley values take a simple average of the marginal contributions across all different cardinalities.

Returns#

np.ndarray

Weights by cardinality of subset

opendataval.dataval.margcontrib.loo module#

class opendataval.dataval.margcontrib.loo.LeaveOneOut(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin

Leave One Out data valuation implementation.

References#

Parameters#

random_stateRandomState, optional

Random initial state, by default None

evaluate_data_values() ndarray#

Compute data values using Leave One Out data valuation.

Returns#

np.ndarray

Predicted data values/selection for training input data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for Leave-One-Out data valuation.

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor

Data labels

x_validtorch.Tensor | Dataset

Test+Held-out covariates

y_validtorch.Tensor

Test+Held-out labels

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Compute the data values using the Leave-One-Out data valuation. Equivalently, LOO can be computed from the marginal contributions as it’s a semivalue.

Parameters#

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

opendataval.dataval.margcontrib.sampler module#

class opendataval.dataval.margcontrib.sampler.GrTMCSampler(*args, **kwargs)#

Bases: Sampler

TMC Sampler with terminator for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler.

References#

Parameters#

gr_thresholdfloat, optional

Convergence threshold for the Gelman-Rubin statistic. Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05

max_mc_epochsint, optional

Max number of outer epochs of MCMC sampling, by default 100

models_per_epochint, optional

Number of model fittings to take per epoch prior to checking GR convergence, by default 100

min_modelsint, optional

Minimum samples before checking MCMC convergence, by default 1000

min_cardinalityint, optional

Minimum cardinality of a training set, must be passed as kwarg, by default 5

cache_namestr, optional

Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object

random_stateRandomState, optional

Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#

Cached marginal contributions.

GR_MAX = 100#

Default maximum Gelman-Rubin statistic. Used for burn-in.

compute_marginal_contribution(*args, **kwargs)#

Compute the marginal contributions for semivalue based data evaluators.

Computes the marginal contribution by sampling. Checks MCMC convergence every 100 iterations using Gelman-Rubin Statistic. NOTE if the marginal contribution has not been calculated, will look it up in a cache of already trained ShapEvaluators, otherwise will train from scratch.

Parameters#

argstuple[Any], optional

Training positional args

kwargsdict[str, Any], optional

Training key word arguments

Notes#

marginal_increment_array_stacknp.ndarray

Marginal increments when one data point is added.

set_coalition(coalition: Tensor)#

Initializes storage to find marginal contribution of each data point

class opendataval.dataval.margcontrib.sampler.MonteCarloSampler(*args, **kwargs)#

Bases: Sampler

Monte Carlo sampler for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler. We take mc_epochs permutations and compute the marginal contributions. Simplest implementation but the least practical.

Parameters#

mc_epochsint, optional

Number of outer epochs of MCMC sampling, by default 1000

min_cardinalityint, optional

Minimum cardinality of a training set, must be passed as kwarg, by default 5

cache_namestr, optional

Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object

random_stateRandomState, optional

Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#

Cached marginal contributions.

compute_marginal_contribution(*args, **kwargs)#

Trains model to predict data values.

Uses permutation sampling to find the marginal contribution of each data point, takes self.mc_epochs number of permutations.

set_coalition(coalition: Tensor)#

Initializes storage to find marginal contribution of each data point

class opendataval.dataval.margcontrib.sampler.Sampler(*args, **kwargs)#

Bases: ABC, ReprMixin

Abstract Sampler class for marginal contribution based data evaluators.

Many marginal contribution based data evaluators depend on a sampling method as they typically can be very computationally expensive. The Sampler class provides a blue print of required methods to be used and the following samplers provide ways of caching computed marginal contributions if given a “cache_name”.

abstract compute_marginal_contribution(*args, **kwargs) ndarray#

Given args and kwargs for the value func, computes marginal contribution.

Returns#

np.ndarray

Marginal contribution array per data point for each coalition size. Dim 0 is the index of the added data point, Dim 1 is the cardinality when the data point is added.

abstract set_coalition(coalition: Tensor) Self#

Given the coalition, initializes data structures to compute marginal contrib.

Parameters#

coalitiontorch.Tensor

Coalition of data to compute the marginal contribution of each data point.

set_evaluator(value_func: Callable[[list[int], ...], float])#

Sets the evaluator function to evaluate the utility of a coalition

Parameters#

value_funcCallable[[list[int], …], float]

T his function sets the utility function which computes the utility for a given coalition of indices.

The following is an example of how the api would work in a DataEvaluator:

self.sampler.set_evaluator(self._evaluate_model)
class opendataval.dataval.margcontrib.sampler.TMCSampler(*args, **kwargs)#

Bases: Sampler

TMCShapley sampler for semivalue-based methods of computing data values.

Evaluators that share marginal contributions should share a sampler.

References#

Parameters#

mc_epochsint, optional

Number of outer epochs of MCMC sampling, by default 1000

min_cardinalityint, optional

Minimum cardinality of a training set, must be passed as kwarg, by default 5

cache_namestr, optional

Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object

random_stateRandomState, optional

Random initial state, by default None

CACHE: ClassVar[dict[str, ndarray]] = {}#

Cached marginal contributions.

compute_marginal_contribution(*args, **kwargs)#

Computes marginal contribution through TMC Shapley.

Uses TMC-Shapley sampling to find the marginal contribution of each data point, takes self.mc_epochs number of samples.

set_coalition(coalition: Tensor)#

Initializes storage to find marginal contribution of each data point

opendataval.dataval.margcontrib.shap module#

class opendataval.dataval.margcontrib.shap.ShapEvaluator(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin, ABC

Abstract class for all semivalue-based methods of computing data values.

References#

Attributes#

samplerSampler, optional

Sampler used to compute the marginal contribution, by default uses TMC-Shapley with a Gelman-Rubin statistic terminator. Samplers are found in sampler

Parameters#

samplerSampler, optional

Sampler used to compute the marginal contributions. Can be found in opendataval/margcontrib/sampler.py, by default GrTMCSampler and uses additonal arguments as constructor for sampler.

gr_thresholdfloat, optional

Convergence threshold for the Gelman-Rubin statistic. Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05

max_mc_epochsint, optional

Max number of outer epochs of MCMC sampling, by default 100

models_per_epochint, optional

Number of model fittings to take per epoch prior to checking GR convergence, by default 100

min_modelsint, optional

Minimum samples before checking MCMC convergence, by default 1000

min_cardinalityint, optional

Minimum cardinality of a training set, must be passed as kwarg, by default 5

cache_namestr, optional

Unique cache_name of the model to cache marginal contributions, set to None to disable caching, by default “” which is set to a unique value for a object

random_stateRandomState, optional

Random initial state, by default None

abstract compute_weight() ndarray#

Compute the weights for each cardinality of training set.

evaluate_data_values() ndarray#

Return data values for each training data point.

Multiplies the marginal contribution with their respective weights to get data values for semivalue-based estimators

Returns#

np.ndarray

Predicted data values/selection for every input data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for semi-value samplers.

Parameters#

x_traintorch.Tensor

Data covariates

y_traintorch.Tensor

Data labels

x_validtorch.Tensor

Test+Held-out covariates

y_validtorch.Tensor

Test+Held-out labels

train_data_values(*args, **kwargs)#

Uses sampler to trains model to find marginal contribs and data values.

Module contents#