opendataval.dataloader.DataFetcher#

class opendataval.dataloader.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Load data for an experiment from an input data set name.

Facade for Register object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.

Parameters#

dataset_namestr

Name of the data set, must be registered with Register

cache_dirstr, optional

Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR

force_downloadbool, optional

Forces download from source URL, by default False

random_stateRandomState, optional

Random initial state, by default None

Attributes#

datapointstuple[torch.Tensor, …]

Train+Valid+Test covariates and labels

covar_dimtuple[int, …]

Covariates dimension of the loaded data set.

label_dimtuple[int, …]

Label dimension of the loaded data set.

num_pointsint

Number of data points in the total data set

one_hotbool

If True, the data set has categorical labels as one hot encodings

[train/valid/test]_indicesnp.ndarray[int]

The indices of the original data set used to make the training data set.

noisy_train_indicesnp.ndarray[int]

The indices of training data points with noise added to them.

covarDataset | np.ndarray

Covariate dataset a dataset function

lablesnp.ndarray

Corresponding labels for covariates from a dataset function

[x/y]_[train/valid/test]np.ndarray

Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.

Raises#

KeyError

In order to use a data set, you must register it by creating a Register

ValueError

Loaded Data set covariates and labels must be of same length.

ValueError

All covariates must be of same dimension. All labels must be of same dimension.

ValueError

Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error

ValueError

Specified indices must not repeat and must not be outside range of the data set

__init__(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Methods

__init__(dataset_name[, cache_dir, ...])

datasets_available()

Get set of available data set names.

export_dataset(covariates_names, labels_names)

from_data(covar, labels, one_hot[, random_state])

Return DataFetcher from input Covariates and Labels.

from_data_splits(x_train, y_train, x_valid, ...)

Return DataFetcher from already split data.

noisify([add_noise])

Add noise to the data points.

setup(dataset_name[, cache_dir, ...])

Create, split, and add noise to DataFetcher from input arguments.

split_dataset_by_count([train_count, ...])

Split the covariates and labels to the specified counts.

split_dataset_by_indices([train_indices, ...])

Split the covariates and labels to the specified indices.

split_dataset_by_prop([train_prop, ...])

Split the covariates and labels to the specified proportions.

Attributes

covar_dim

Get covar dimensions.

datapoints

Return split data points to be input into a DataEvaluator as tensors.

label_dim

Get label dimensions.

num_points

Get total number of data points.