Compute data values using the Data Banzhaf data valuator. Finds difference
of average performance of all sets including data point minus not-including.
Data Banzhaf implementation using the marginal contributions.
Data Banzhaf implementation using the ShapEvaluator, which already computes the
marginal contributions for other evaluators. This approach may not be as efficient
as the previous approach, but is recommended to minimize compute time if
you cache a previous computation.
Convergence threshold for the Gelman-Rubin statistic.
Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05
max_mc_epochsint, optional
Max number of outer epochs of MCMC sampling, by default 100
models_per_epochint, optional
Number of model fittings to take per epoch prior to checking GR convergence,
by default 100
min_modelsint, optional
Minimum samples before checking MCMC convergence, by default 1000
min_cardinalityint, optional
Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional
Unique cache_name of the model to cache marginal contributions, set to None to
disable caching, by default “” which is set to a unique value for a object
Compute the marginal contributions for semivalue based data evaluators.
Computes the marginal contribution by sampling.
Checks MCMC convergence every 100 iterations using Gelman-Rubin Statistic.
NOTE if the marginal contribution has not been calculated, will look it up in
a cache of already trained ShapEvaluators, otherwise will train from scratch.
Monte Carlo sampler for semivalue-based methods of computing data values.
Evaluators that share marginal contributions should share a sampler. We take
mc_epochs permutations and compute the marginal contributions. Simplest
implementation but the least practical.
Number of outer epochs of MCMC sampling, by default 1000
min_cardinalityint, optional
Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional
Unique cache_name of the model to cache marginal contributions, set to None to
disable caching, by default “” which is set to a unique value for a object
Abstract Sampler class for marginal contribution based data evaluators.
Many marginal contribution based data evaluators depend on a sampling method as
they typically can be very computationally expensive. The Sampler class provides
a blue print of required methods to be used and the following samplers provide ways
of caching computed marginal contributions if given a “cache_name”.
Marginal contribution array per data point for each coalition size. Dim 0 is
the index of the added data point, Dim 1 is the cardinality when the data
point is added.
Number of outer epochs of MCMC sampling, by default 1000
min_cardinalityint, optional
Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional
Unique cache_name of the model to cache marginal contributions, set to None to
disable caching, by default “” which is set to a unique value for a object
Sampler used to compute the marginal contributions. Can be found in
opendataval/margcontrib/sampler.py, by default GrTMCSampler and uses additonal
arguments as constructor for sampler.
gr_thresholdfloat, optional
Convergence threshold for the Gelman-Rubin statistic.
Shapley values are NP-hard so we resort to MCMC sampling, by default 1.05
max_mc_epochsint, optional
Max number of outer epochs of MCMC sampling, by default 100
models_per_epochint, optional
Number of model fittings to take per epoch prior to checking GR convergence,
by default 100
min_modelsint, optional
Minimum samples before checking MCMC convergence, by default 1000
min_cardinalityint, optional
Minimum cardinality of a training set, must be passed as kwarg, by default 5
cache_namestr, optional
Unique cache_name of the model to cache marginal contributions, set to None to
disable caching, by default “” which is set to a unique value for a object