opendataval.dataval.dvrl package#

Submodules#

opendataval.dataval.dvrl.dvrl module#

class opendataval.dataval.dvrl.dvrl.DVRL(*args, **kwargs)#

Bases: DataEvaluator, ModelMixin

Data valuation using reinforcement learning class, implemented with PyTorch.

References#

Parameters#

hidden_dimint, optional: Hidden dimensions for the RL Multilayer Perceptron Value Estimator (VE) (details in DataValueEstimatorRL class), by default 100
layer_numberint, optional: Number of hidden layers for the Value Estimator (VE), by default 5
comb_dimint, optional: After concat inputs how many layers, much less than hidden_dim, by default 10
rl_epochsint, optional: Number of training epochs for the VE, by default 1000
rl_batch_sizeint, optional: Batch size for training the VE, by default 32
lrfloat, optional: Learning rate for the VE, by default 0.01
thresholdfloat, optional: Search rate threshold, the VE may get stuck in certain bounds close to \([0, 1]\), thus outside of \([1-threshold, threshold]\) we encourage searching, by default 0.9
devicetorch.device, optional: Tensor device for acceleration, by default torch.device(“cpu”)
random_stateRandomState, optional: Random initial state, by default None

evaluate_data_values() → ndarray#

Return data values for each training data point.

Compute data values for DVRL using the Value Estimator MLP.

Returns#

np.ndarray: Predicted data values/selection for training input data point

input_data(x_train: Tensor, y_train: Tensor, x_valid: Tensor, y_valid: Tensor)#

Store and transform input data for DVRL.

Parameters#

x_traintorch.Tensor: Data covariates
y_traintorch.Tensor: Data labels
x_validtorch.Tensor: Test+Held-out covariates
y_validtorch.Tensor: Test+Held-out labels

train_data_values(*args, num_workers: int = 0, **kwargs)#

Trains model to predict data values.

Trains the VE to assign probabilities of each data point being selected using a signal from the evaluation performance.

Parameters#

argstuple[Any], optional: Training positional args
num_workersint, optional: Number of workers used to load data, by default 0, loaded in main process
kwargsdict[str, Any], optional: Training key word arguments

class opendataval.dataval.dvrl.dvrl.DataValueEstimatorRL(x_dim: int, y_dim: int, hidden_dim: int, layer_number: int, comb_dim: int, random_state: RandomState | None = None)#

Bases: Module

Value Estimator model.

Here, we assume a simple multi-layer perceptron architecture for the data value evaluator model. For data types like tabular, multi-layer perceptron is already efficient at extracting the relevant information. For high-dimensional data types like images or text, it is important to introduce inductive biases to the architecture to extract information efficiently. In such cases, there are two options: (i) Input the encoded representations (e.g. the last layer activations of ResNet for images, or the last layer activations of BERT for text) and use the multi-layer perceptron on top of it. The encoded representations can simply come from a pre-trained predictor model using the entire dataset. (ii) Modify the data value evaluator model definition below to have the appropriate inductive bias (e.g. using convolutions layers for images, or attention layers text).

References#

Parameters#

x_dimint: Data covariates dimension, can be flatten dimension size
y_dimint: Data labels dimension, can be flatten dimension size
hidden_dimint: Hidden dimensions for the Value Estimator
layer_numberint: Number of hidden layers for the Value Estimator
comb_dimint: After concat inputs how many layers, much less than hidden_dim, by default 10
random_stateRandomState, optional: Random initial state, by default None

forward(x: Tensor, y: Tensor, y_hat: Tensor) → Tensor#

Forward pass of inputs through value estimator for data values of input.

Forward pass through Value Estimator. Returns selection probabilities. Concats the difference between labels and predicted labels to compute selection probabilities.

Parameters#

xtorch.Tensor: Data covariates
ytorch.Tensor: Data labels
y_hattorch.Tensor: Data label predictions (from prediction model)

Returns#

torch.Tensor: Selection probabilities per covariate data point

class opendataval.dataval.dvrl.dvrl.DveLoss(threshold: float = 0.9, exploration_weight: float = 1000.0)#

Bases: Module

Compute Loss for Value Estimator.

Custom loss function for the value estimator RL Model. Uses BCE Loss and checks average is within threshold to encourage exploration

Parameters#

thresholdfloat, optional: Search rate threshold, the VE may get stuck in certain bounds close to \([0, 1]\), thus outside of \([1-threshold, threshold]\) we encourage searching, by default 0.9
exploration_weightfloat, optional: Large constant to encourage exploration in the Value Estimator, by default 1e3

forward(pred_dataval: Tensor, selector_input: Tensor, reward_input: float) → Tensor#

Compute the loss for the Value Estimator.

Uses REINFORCE Algorithm to compute a loss for the Value Estimator. pred_dataval is the data values. selector_input is a bernoulli random variable with p=pred_dataval. Computes a BCE between pred_dataval and selector_input and multiplies by the reward signal. Adds an additional loss if the Value Estimator is getting stuck outside the threshold.

References#

Parameters#

pred_datavaltorch.Tensor: Predicted values from value estimator
selector_inputtorch.Tensor: 1 for selected 0 for not selected, bernoulli random variable
reward_inputfloat: Reward/performance signal of prediction model trained on selector_input. If positive, indicates better than naive model of full sample.

Returns#

torch.Tensor: Computed loss tensor for Value Estimator

opendataval.dataval.dvrl package#

Submodules#

opendataval.dataval.dvrl.dvrl module#

References#

Parameters#

Returns#

Parameters#

Parameters#

References#

Parameters#

Parameters#

Returns#

Parameters#

References#

Parameters#

Returns#

Module contents#