opendataval.dataval.lava package#
Submodules#
opendataval.dataval.lava.lava module#
- class opendataval.dataval.lava.lava.LavaEvaluator(*args, **kwargs)#
Bases:
DataEvaluator
,ModelLessMixin
Data valuation using LAVA implementation.
References#
Parameters#
- devicetorch.device, optional
Tensor device for acceleration, by default torch.device(“cpu”)
- random_state: RandomState, optional
Random initial state, by default None
Mixins#
- ModelLessMixin
Mixin for a data evaluator that doesn’t require a model or evaluation metric.
- evaluate_data_values() ndarray #
Return data values for each training data point.
Gets the calibrated gradient of the dual solution, which can be interpreted as the data values.
Returns#
- np.ndarray
Predicted data values/selection for training input data point
- train_data_values(*args, **kwargs)#
Trains model to predict data values.
Computes the class-wise Wasserstein distance between the training and the validation set.
References#
[1] H. A. Just, F. Kang, T. Wang, Y. Zeng, M. Ko, M. Jin, and R. Jia, LAVA: Data Valuation without Pre-Specified Learning Algorithms, 2023. Available: https://openreview.net/forum?id=JJuP86nBl4q
[2] D. Alvarez-Melis and N. Fusi, Geometric Dataset Distances via Optimal Transport, arXiv.org, 2020. Available: https://arxiv.org/abs/2002.02923.
[3] D. Alvarez-Melis and N. Fusi, Dataset Dynamics via Gradient Flows in Probability Space, arXiv.org, 2020. Available: https://arxiv.org/abs/2010.12760.
- opendataval.dataval.lava.lava.macos_fix()#
Geomloss package has a bug on MacOS remedied as follows.
`Link to similar bug: https://github.com/NVlabs/stylegan3/issues/75`_.
opendataval.dataval.lava.otdd module#
Main module for computing exact wasserstein distance between two data sets.
References#
[1] D. Alvarez-Melis and N. Fusi, Geometric Dataset Distances via Optimal Transport, arXiv.org, 2020. Available: https://arxiv.org/abs/2002.02923.
[2] D. Alvarez-Melis and N. Fusi, Dataset Dynamics via Gradient Flows in Probability Space, arXiv.org, 2020. Available: https://arxiv.org/abs/2010.12760.
[3] OTDD repo. The following implementation was taken from this repository. It is intended as a strict subset of the options provided in the repository, only computing the class-wise Wasserstein as needed by the LAVA Paper by H.A. Just et al.
- Legacy notation:
X1, X2: feature tensors of the two datasets Y1, Y2: label tensors of the two datasets N1, N2 (or N,M): number of samples in datasets D1, D2: (feature) dimension of the datasets C1, C2: number of classes in the datasets
- class opendataval.dataval.lava.otdd.DatasetDistance(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor, feature_cost: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, entreg: float = 0.1, lam_x: float = 1.0, lam_y: float = 1.0, inner_ot_loss: str = 'sinkhorn', inner_ot_debiased: bool = False, inner_ot_p: int = 2, inner_ot_entreg: float = 0.1, device: device = device(type='cpu'))#
Bases:
object
The main class for the Optimal Transport Dataset Distance.
An object of this class is instantiated with two datasets (the source and target), which are stored in it, and various arguments determining how the exact Wasserstein distance is to be computed.
Parameters#
- x_traintorch.Tensor
Covariates of the first distribution
- y_traintorch.Tensor
Labels of the first distribution
- x_validtorch.Tensor
Covariates of the second/validation distribution
- y_validtorch.Tensor
Labels of the second/validation distribution
- feature_costLiteral[“euclidean”] | Callable, optional
If not ‘euclidean’, must be a callable that implements a cost function between feature vectors, by default “euclidean”
- pint, optional
The coefficient in the OT cost (i.e., the p in p-Wasserstein), by default 2
- entregfloat, optional
The strength of entropy regularization for sinkhorn, by default 0.1
- lam_xfloat, optional
Weight parameter for feature component of distance, by default 1.0
- lam_yfloat, optional
Weight parameter for label component of distance.=, by default 1.0
- inner_ot_lossstr, optional
Loss type to exact OT problem, by default “sinkhorn”
- inner_ot_debiasedbool, optional
Whether to use the debiased version of sinkhorn in the inner OT problem, by default False
- inner_ot_pint, optional
The coefficient in the inner OT cost., by default 2
- inner_ot_entregfloat, optional
The strength of entropy regularization for sinkhorn in the inner OT problem, by default 0.1
- devicetorch.device, optional
Tensor device for acceleration, by default torch.device(“cpu”)
- dual_sol() tuple[float, Tensor] #
Compute dataset distance.
- Note:
Currently requires fully loading dataset into memory, this can probably be avoided, e.g., via subsampling.
Returns#
- tuple[float, torch.Tensor]
dist (float): the optimal transport dataset distance value. pi (tensor, optional): the optimal transport coupling.
- class opendataval.dataval.lava.otdd.FeatureCost(src_embedding=None, tgt_embedding=None, src_dim=None, tgt_dim=None, p=2, device='cpu')#
Bases:
object
Class implementing a cost (or distance) between feature vectors.
- Arguments:
p (int): the coefficient in the OT cost (i.e., the p in p-Wasserstein). src_embedding (callable, optional): if provided, source data will be
embedded using this function prior to distance computation.
- tgt_embedding (callable, optional): if provided, target data will be
embedded using this function prior to distance computation.
- opendataval.dataval.lava.otdd.batch_augmented_cost(Z1: Tensor, Z2: Tensor, W: Tensor | None = None, feature_cost: str | None = None, p: int = 2, lam_x: float = 1.0, lam_y: float = 1.0)#
Batch ground cost computation on augmented datasets.
Parameters#
- Z1torch.Tensor
Tensor of size (B,N,D1), where last position in last dim corresponds to label Y.
- Z2torch.Tensor
Tensor of size (B,M,D2), where last position in last dim corresponds to label Y.
- Wtorch.Tensor, optional
Tensor of size (V1,V2) of precomputed pairwise label distances for all labels V1,V2 and returns a batched cost matrix as a (B,N,M) Tensor. W is expected to be congruent with p. I.e, if p=2, W[i,j] should be squared Wasserstein distance., by default None
- feature_coststr, optional
if None or ‘euclidean’, uses euclidean distances as feature metric, otherwise uses this function as metric., by default None
- pint, optional
Power of the cost (i.e. order of p-Wasserstein distance), by default 2
- lam_xfloat, optional
Weight parameter for feature component of distance, by default 1.0
- lam_yfloat, optional
Weight parameter for label component of distance, by default 1.0
Returns#
- torch.Tensor
torch Tensor of size (B,N,M)
Raises#
- ValueError
If W is not provided
- opendataval.dataval.lava.otdd.extract_dataset(x_input: Tensor, y_input: Tensor, batch_size: int = 256, reindex_start: int = 0) tuple[Tensor, Tensor] #
Loads full dataset into memory and reindexes the labels.
Parameters#
- x_inputDataset | torch.Tensor
Covariate Dataset/tensor to be loaded
- y_inputDataset | torch.Tensor
Label Dataset/tensor to be loaded
- batch_sizeint, optional
Batch size of data to be loaded at a time, by default 256
- reindex_startint, optional
How much to offset the labels by, useful when comparing different data sets so that the data have different labels, by default 0
Returns#
- tuple[torch.Tensor, torch.Tensor]
x_tensor Covariates stacked along first dimension y_tensor Labels, no longer in one-hot-encoding and offset by reindex_start
- opendataval.dataval.lava.otdd.pwdist_exact(X1: Tensor, Y1: Tensor, X2: Tensor | None = None, Y2: Tensor | None = None, symmetric: bool = False, loss: str = 'sinkhorn', cost_function: Literal['euclidean'] | Callable[[...], Tensor] = 'euclidean', p: int = 2, debias: bool = True, entreg: float = 0.1, device: device = device(type='cpu'))#
Computation of pairwise Wasserstein distances.
Efficient computation of pairwise label-to-label Wasserstein distances between multiple distributions, without using Gaussian assumption.
Parameters#
- X1torch.Tensor
Covariates of first distribution
- Y1torch.Tensor
Labels of first distribution
- X2torch.Tensor, optional
Covariates of second distribution, if None distributions are treated as same, by default None
- Y2torch.Tensor, optional
Labels of second distribution, iif None distributions are treated as same, by default None
- symmetricbool, optional
Whether X1/Y1 and X2/Y2 are to be treated as the same dataset, by default False
- lossstr, optional
The loss function to compute. Sinkhorn divergence, which interpolates between (blur=0) and kernel (blur= \(+\infty\) ) distances., by default “sinkhorn”
- cost_function: Literal[“euclidean”] | Callable[…, torch.Tensor], optional
Cost function that should be used instead of \(\tfrac{1}{p}\|x-y\|^p\), by default “euclidean”
- pint, optional
power of the cost (i.e. order of p-Wasserstein distance), by default 2
- debiasbool, optional
If true, uses debiased sinkhorn divergence., by default True
- entregfloat, optional
The strength of entropy regularization for sinkhorn., by default 1e-1
- devicetorch.device, optional
Tensor device for acceleration, by default torch.device(“cpu”)
Returns#
- torch.Tensor
Computed Wasserstein distance