opendataval.dataval.lava package#

Submodules#

opendataval.dataval.lava.lava module#

class opendataval.dataval.lava.lava.LavaEvaluator(*args, **kwargs)#

Bases: DataEvaluator, ModelLessMixin

Data valuation using LAVA implementation.

References#

Parameters#

devicetorch.device, optional: Tensor device for acceleration, by default torch.device(“cpu”)
random_state: RandomState, optional: Random initial state, by default None

Mixins#

ModelLessMixin: Mixin for a data evaluator that doesn’t require a model or evaluation metric.

evaluate_data_values() → ndarray#

Return data values for each training data point.

Gets the calibrated gradient of the dual solution, which can be interpreted as the data values.

Returns#

np.ndarray: Predicted data values/selection for training input data point

train_data_values(*args, **kwargs)#

Trains model to predict data values.

Computes the class-wise Wasserstein distance between the training and the validation set.

References#

opendataval.dataval.lava.lava.macos_fix()#

Geomloss package has a bug on MacOS remedied as follows.

opendataval.dataval.lava.otdd module#

Main module for computing exact wasserstein distance between two data sets.

OTDD Repository.

References#

[1]
D. Alvarez-Melis and N. Fusi, Geometric Dataset Distances via Optimal Transport, arXiv.org, 2020. Available: https://arxiv.org/abs/2002.02923.

[2]
D. Alvarez-Melis and N. Fusi, Dataset Dynamics via Gradient Flows in Probability Space, arXiv.org, 2020. Available: https://arxiv.org/abs/2010.12760.

[3]
OTDD repo. The following implementation was taken from this repository. It is intended as a strict subset of the options provided in the repository, only computing the class-wise Wasserstein as needed by the LAVA Paper by H.A. Just et al.

Legacy notation:: X1, X2: feature tensors of the two datasets Y1, Y2: label tensors of the two datasets N1, N2 (or N,M): number of samples in datasets D1, D2: (feature) dimension of the datasets C1, C2: number of classes in the datasets

class opendataval.dataval.lava.otdd.DatasetDistance(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor, feature_cost: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, entreg: float = 0.1, lam_x: float = 1.0, lam_y: float = 1.0, inner_ot_loss: str = 'sinkhorn', inner_ot_debiased: bool = False, inner_ot_p: int = 2, inner_ot_entreg: float = 0.1, device: device = device(type='cpu'))#

Bases: object

The main class for the Optimal Transport Dataset Distance.

An object of this class is instantiated with two datasets (the source and target), which are stored in it, and various arguments determining how the exact Wasserstein distance is to be computed.

Parameters#

x_traintorch.Tensor: Covariates of the first distribution
y_traintorch.Tensor: Labels of the first distribution
x_validtorch.Tensor: Covariates of the second/validation distribution
y_validtorch.Tensor: Labels of the second/validation distribution
feature_costLiteral[“euclidean”] | Callable, optional: If not ‘euclidean’, must be a callable that implements a cost function between feature vectors, by default “euclidean”
pint, optional: The coefficient in the OT cost (i.e., the p in p-Wasserstein), by default 2
entregfloat, optional: The strength of entropy regularization for sinkhorn, by default 0.1
lam_xfloat, optional: Weight parameter for feature component of distance, by default 1.0
lam_yfloat, optional: Weight parameter for label component of distance.=, by default 1.0
inner_ot_lossstr, optional: Loss type to exact OT problem, by default “sinkhorn”
inner_ot_debiasedbool, optional: Whether to use the debiased version of sinkhorn in the inner OT problem, by default False
inner_ot_pint, optional: The coefficient in the inner OT cost., by default 2
inner_ot_entregfloat, optional: The strength of entropy regularization for sinkhorn in the inner OT problem, by default 0.1
devicetorch.device, optional: Tensor device for acceleration, by default torch.device(“cpu”)

dual_sol() → tuple[float, Tensor]#

Compute dataset distance.

Note:: Currently requires fully loading dataset into memory, this can probably be avoided, e.g., via subsampling.

Returns#

tuple[float, torch.Tensor]: dist (float): the optimal transport dataset distance value. pi (tensor, optional): the optimal transport coupling.

class opendataval.dataval.lava.otdd.FeatureCost(src_embedding=None, tgt_embedding=None, src_dim=None, tgt_dim=None, p=2, device='cpu')#

Bases: object

Class implementing a cost (or distance) between feature vectors.

Arguments:

p (int): the coefficient in the OT cost (i.e., the p in p-Wasserstein). src_embedding (callable, optional): if provided, source data will be

embedded using this function prior to distance computation.

tgt_embedding (callable, optional): if provided, target data will be: embedded using this function prior to distance computation.

opendataval.dataval.lava.otdd.batch_augmented_cost(Z1: Tensor, Z2: Tensor, W: Tensor | None = None, feature_cost: str | None = None, p: int = 2, lam_x: float = 1.0, lam_y: float = 1.0)#

Batch ground cost computation on augmented datasets.

Parameters#

Z1torch.Tensor: Tensor of size (B,N,D1), where last position in last dim corresponds to label Y.
Z2torch.Tensor: Tensor of size (B,M,D2), where last position in last dim corresponds to label Y.
Wtorch.Tensor, optional: Tensor of size (V1,V2) of precomputed pairwise label distances for all labels V1,V2 and returns a batched cost matrix as a (B,N,M) Tensor. W is expected to be congruent with p. I.e, if p=2, W[i,j] should be squared Wasserstein distance., by default None
feature_coststr, optional: if None or ‘euclidean’, uses euclidean distances as feature metric, otherwise uses this function as metric., by default None
pint, optional: Power of the cost (i.e. order of p-Wasserstein distance), by default 2
lam_xfloat, optional: Weight parameter for feature component of distance, by default 1.0
lam_yfloat, optional: Weight parameter for label component of distance, by default 1.0

Returns#

torch.Tensor: torch Tensor of size (B,N,M)

Raises#

ValueError: If W is not provided

opendataval.dataval.lava.otdd.extract_dataset(x_input: Tensor, y_input: Tensor, batch_size: int = 256, reindex_start: int = 0) → tuple[Tensor, Tensor]#

Loads full dataset into memory and reindexes the labels.

Parameters#

x_inputDataset | torch.Tensor: Covariate Dataset/tensor to be loaded
y_inputDataset | torch.Tensor: Label Dataset/tensor to be loaded
batch_sizeint, optional: Batch size of data to be loaded at a time, by default 256
reindex_startint, optional: How much to offset the labels by, useful when comparing different data sets so that the data have different labels, by default 0

Returns#

tuple[torch.Tensor, torch.Tensor]: x_tensor Covariates stacked along first dimension y_tensor Labels, no longer in one-hot-encoding and offset by reindex_start

opendataval.dataval.lava.otdd.pwdist_exact(X1: Tensor, Y1: Tensor, X2: Tensor | None = None, Y2: Tensor | None = None, symmetric: bool = False, loss: str = 'sinkhorn', cost_function: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, debias: bool = True, entreg: float = 0.1, device: device = device(type='cpu'))#

Computation of pairwise Wasserstein distances.

Efficient computation of pairwise label-to-label Wasserstein distances between multiple distributions, without using Gaussian assumption.

Parameters#

X1torch.Tensor: Covariates of first distribution
Y1torch.Tensor: Labels of first distribution
X2torch.Tensor, optional: Covariates of second distribution, if None distributions are treated as same, by default None
Y2torch.Tensor, optional: Labels of second distribution, iif None distributions are treated as same, by default None
symmetricbool, optional: Whether X1/Y1 and X2/Y2 are to be treated as the same dataset, by default False
lossstr, optional: The loss function to compute. Sinkhorn divergence, which interpolates between (blur=0) and kernel (blur= \(+\infty\) ) distances., by default “sinkhorn”
cost_function: Literal[“euclidean”] | Callable[…, torch.Tensor], optional: Cost function that should be used instead of \(\tfrac{1}{p}\|x-y\|^p\), by default “euclidean”
pint, optional: power of the cost (i.e. order of p-Wasserstein distance), by default 2
debiasbool, optional: If true, uses debiased sinkhorn divergence., by default True
entregfloat, optional: The strength of entropy regularization for sinkhorn., by default 1e-1
devicetorch.device, optional: Tensor device for acceleration, by default torch.device(“cpu”)

Returns#

torch.Tensor: Computed Wasserstein distance

opendataval.dataval.lava package#

Submodules#

opendataval.dataval.lava.lava module#

References#

Parameters#

Mixins#

Returns#

References#

opendataval.dataval.lava.otdd module#

References#

Parameters#

Returns#

Parameters#

Returns#

Raises#

Parameters#

Returns#

Parameters#

Returns#

Module contents#