opendataval.dataval.lava package#

Submodules#

opendataval.dataval.lava.lava module#

class opendataval.dataval.lava.lava.LavaEvaluator(*args, **kwargs)#

Bases: DataEvaluator, ModelLessMixin

Data valuation using LAVA implementation.

References#

Parameters#

devicetorch.device, optional

Tensor device for acceleration, by default torch.device(“cpu”)

random_state: RandomState, optional

Random initial state, by default None

Mixins#

ModelLessMixin

Mixin for a data evaluator that doesn’t require a model or evaluation metric.

evaluate_data_values() ndarray#

Return data values for each training data point.

Gets the calibrated gradient of the dual solution, which can be interpreted as the data values.

Returns#

np.ndarray

Predicted data values/selection for training input data point

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Computes the class-wise Wasserstein distance between the training and the validation set.

References#

opendataval.dataval.lava.lava.macos_fix()#

Geomloss package has a bug on MacOS remedied as follows.

`Link to similar bug: https://github.com/NVlabs/stylegan3/issues/75`_.

opendataval.dataval.lava.otdd module#

Main module for computing exact wasserstein distance between two data sets.

OTDD Repository.

References#

Legacy notation:

X1, X2: feature tensors of the two datasets Y1, Y2: label tensors of the two datasets N1, N2 (or N,M): number of samples in datasets D1, D2: (feature) dimension of the datasets C1, C2: number of classes in the datasets

class opendataval.dataval.lava.otdd.DatasetDistance(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor, feature_cost: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, entreg: float = 0.1, lam_x: float = 1.0, lam_y: float = 1.0, inner_ot_loss: str = 'sinkhorn', inner_ot_debiased: bool = False, inner_ot_p: int = 2, inner_ot_entreg: float = 0.1, device: device = device(type='cpu'))#

Bases: object

The main class for the Optimal Transport Dataset Distance.

An object of this class is instantiated with two datasets (the source and target), which are stored in it, and various arguments determining how the exact Wasserstein distance is to be computed.

Parameters#

x_traintorch.Tensor

Covariates of the first distribution

y_traintorch.Tensor

Labels of the first distribution

x_validtorch.Tensor

Covariates of the second/validation distribution

y_validtorch.Tensor

Labels of the second/validation distribution

feature_costLiteral[“euclidean”] | Callable, optional

If not ‘euclidean’, must be a callable that implements a cost function between feature vectors, by default “euclidean”

pint, optional

The coefficient in the OT cost (i.e., the p in p-Wasserstein), by default 2

entregfloat, optional

The strength of entropy regularization for sinkhorn, by default 0.1

lam_xfloat, optional

Weight parameter for feature component of distance, by default 1.0

lam_yfloat, optional

Weight parameter for label component of distance.=, by default 1.0

inner_ot_lossstr, optional

Loss type to exact OT problem, by default “sinkhorn”

inner_ot_debiasedbool, optional

Whether to use the debiased version of sinkhorn in the inner OT problem, by default False

inner_ot_pint, optional

The coefficient in the inner OT cost., by default 2

inner_ot_entregfloat, optional

The strength of entropy regularization for sinkhorn in the inner OT problem, by default 0.1

devicetorch.device, optional

Tensor device for acceleration, by default torch.device(“cpu”)

dual_sol() tuple[float, Tensor]#

Compute dataset distance.

Note:

Currently requires fully loading dataset into memory, this can probably be avoided, e.g., via subsampling.

Returns#

tuple[float, torch.Tensor]

dist (float): the optimal transport dataset distance value. pi (tensor, optional): the optimal transport coupling.

class opendataval.dataval.lava.otdd.FeatureCost(src_embedding=None, tgt_embedding=None, src_dim=None, tgt_dim=None, p=2, device='cpu')#

Bases: object

Class implementing a cost (or distance) between feature vectors.

Arguments:

p (int): the coefficient in the OT cost (i.e., the p in p-Wasserstein). src_embedding (callable, optional): if provided, source data will be

embedded using this function prior to distance computation.

tgt_embedding (callable, optional): if provided, target data will be

embedded using this function prior to distance computation.

opendataval.dataval.lava.otdd.batch_augmented_cost(Z1: Tensor, Z2: Tensor, W: Tensor | None = None, feature_cost: str | None = None, p: int = 2, lam_x: float = 1.0, lam_y: float = 1.0)#

Batch ground cost computation on augmented datasets.

Parameters#

Z1torch.Tensor

Tensor of size (B,N,D1), where last position in last dim corresponds to label Y.

Z2torch.Tensor

Tensor of size (B,M,D2), where last position in last dim corresponds to label Y.

Wtorch.Tensor, optional

Tensor of size (V1,V2) of precomputed pairwise label distances for all labels V1,V2 and returns a batched cost matrix as a (B,N,M) Tensor. W is expected to be congruent with p. I.e, if p=2, W[i,j] should be squared Wasserstein distance., by default None

feature_coststr, optional

if None or ‘euclidean’, uses euclidean distances as feature metric, otherwise uses this function as metric., by default None

pint, optional

Power of the cost (i.e. order of p-Wasserstein distance), by default 2

lam_xfloat, optional

Weight parameter for feature component of distance, by default 1.0

lam_yfloat, optional

Weight parameter for label component of distance, by default 1.0

Returns#

torch.Tensor

torch Tensor of size (B,N,M)

Raises#

ValueError

If W is not provided

opendataval.dataval.lava.otdd.extract_dataset(x_input: Tensor, y_input: Tensor, batch_size: int = 256, reindex_start: int = 0) tuple[Tensor, Tensor]#

Loads full dataset into memory and reindexes the labels.

Parameters#

x_inputDataset | torch.Tensor

Covariate Dataset/tensor to be loaded

y_inputDataset | torch.Tensor

Label Dataset/tensor to be loaded

batch_sizeint, optional

Batch size of data to be loaded at a time, by default 256

reindex_startint, optional

How much to offset the labels by, useful when comparing different data sets so that the data have different labels, by default 0

Returns#

tuple[torch.Tensor, torch.Tensor]

x_tensor Covariates stacked along first dimension y_tensor Labels, no longer in one-hot-encoding and offset by reindex_start

opendataval.dataval.lava.otdd.pwdist_exact(X1: Tensor, Y1: Tensor, X2: Tensor | None = None, Y2: Tensor | None = None, symmetric: bool = False, loss: str = 'sinkhorn', cost_function: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, debias: bool = True, entreg: float = 0.1, device: device = device(type='cpu'))#

Computation of pairwise Wasserstein distances.

Efficient computation of pairwise label-to-label Wasserstein distances between multiple distributions, without using Gaussian assumption.

Parameters#

X1torch.Tensor

Covariates of first distribution

Y1torch.Tensor

Labels of first distribution

X2torch.Tensor, optional

Covariates of second distribution, if None distributions are treated as same, by default None

Y2torch.Tensor, optional

Labels of second distribution, iif None distributions are treated as same, by default None

symmetricbool, optional

Whether X1/Y1 and X2/Y2 are to be treated as the same dataset, by default False

lossstr, optional

The loss function to compute. Sinkhorn divergence, which interpolates between (blur=0) and kernel (blur= \(+\infty\) ) distances., by default “sinkhorn”

cost_function: Literal[“euclidean”] | Callable[…, torch.Tensor], optional

Cost function that should be used instead of \(\tfrac{1}{p}\|x-y\|^p\), by default “euclidean”

pint, optional

power of the cost (i.e. order of p-Wasserstein distance), by default 2

debiasbool, optional

If true, uses debiased sinkhorn divergence., by default True

entregfloat, optional

The strength of entropy regularization for sinkhorn., by default 1e-1

devicetorch.device, optional

Tensor device for acceleration, by default torch.device(“cpu”)

Returns#

torch.Tensor

Computed Wasserstein distance

Module contents#