opendataval.dataloader.DataFetcher#

class opendataval.dataloader.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Load data for an experiment from an input data set name.

Facade for Register object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.

Parameters#

dataset_namestr: Name of the data set, must be registered with Register
cache_dirstr, optional: Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR
force_downloadbool, optional: Forces download from source URL, by default False
random_stateRandomState, optional: Random initial state, by default None

Attributes#

datapointstuple[torch.Tensor, …]: Train+Valid+Test covariates and labels
covar_dimtuple[int, …]: Covariates dimension of the loaded data set.
label_dimtuple[int, …]: Label dimension of the loaded data set.
num_pointsint: Number of data points in the total data set
one_hotbool: If True, the data set has categorical labels as one hot encodings
[train/valid/test]_indicesnp.ndarray[int]: The indices of the original data set used to make the training data set.
noisy_train_indicesnp.ndarray[int]: The indices of training data points with noise added to them.
covarDataset | np.ndarray: Covariate dataset a dataset function
lablesnp.ndarray: Corresponding labels for covariates from a dataset function
[x/y]_[train/valid/test]np.ndarray: Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.

Raises#

KeyError: In order to use a data set, you must register it by creating a Register
ValueError: Loaded Data set covariates and labels must be of same length.
ValueError: All covariates must be of same dimension. All labels must be of same dimension.
ValueError: Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error
ValueError: Specified indices must not repeat and must not be outside range of the data set

__init__(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Methods

`__init__`(dataset_name[, cache_dir, ...])
`datasets_available`()	Get set of available data set names.
`export_dataset`(covariates_names, labels_names)
`from_data`(covar, labels, one_hot[, random_state])	Return DataFetcher from input Covariates and Labels.
`from_data_splits`(x_train, y_train, x_valid, ...)	Return DataFetcher from already split data.
`noisify`([add_noise])	Add noise to the data points.
`setup`(dataset_name[, cache_dir, ...])	Create, split, and add noise to DataFetcher from input arguments.
`split_dataset_by_count`([train_count, ...])	Split the covariates and labels to the specified counts.
`split_dataset_by_indices`([train_indices, ...])	Split the covariates and labels to the specified indices.
`split_dataset_by_prop`([train_prop, ...])	Split the covariates and labels to the specified proportions.

Attributes

`covar_dim`	Get covar dimensions.
`datapoints`	Return split data points to be input into a DataEvaluator as tensors.
`label_dim`	Get label dimensions.
`num_points`	Get total number of data points.