opendataval.dataloader.DataFetcher#
- class opendataval.dataloader.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#
Load data for an experiment from an input data set name.
Facade for
Register
object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.Parameters#
- dataset_namestr
Name of the data set, must be registered with
Register
- cache_dirstr, optional
Directory of where to cache the loaded data, by default None which uses
Register.CACHE_DIR
- force_downloadbool, optional
Forces download from source URL, by default False
- random_stateRandomState, optional
Random initial state, by default None
Attributes#
- datapointstuple[torch.Tensor, …]
Train+Valid+Test covariates and labels
- covar_dimtuple[int, …]
Covariates dimension of the loaded data set.
- label_dimtuple[int, …]
Label dimension of the loaded data set.
- num_pointsint
Number of data points in the total data set
- one_hotbool
If True, the data set has categorical labels as one hot encodings
- [train/valid/test]_indicesnp.ndarray[int]
The indices of the original data set used to make the training data set.
- noisy_train_indicesnp.ndarray[int]
The indices of training data points with noise added to them.
- covarDataset | np.ndarray
Covariate dataset a dataset function
- lablesnp.ndarray
Corresponding labels for covariates from a dataset function
- [x/y]_[train/valid/test]np.ndarray
Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.
Raises#
- KeyError
In order to use a data set, you must register it by creating a
Register
- ValueError
Loaded Data set covariates and labels must be of same length.
- ValueError
All covariates must be of same dimension. All labels must be of same dimension.
- ValueError
Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error
- ValueError
Specified indices must not repeat and must not be outside range of the data set
- __init__(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#
Methods
__init__
(dataset_name[, cache_dir, ...])datasets_available
()Get set of available data set names.
export_dataset
(covariates_names, labels_names)from_data
(covar, labels, one_hot[, random_state])Return DataFetcher from input Covariates and Labels.
from_data_splits
(x_train, y_train, x_valid, ...)Return DataFetcher from already split data.
noisify
([add_noise])Add noise to the data points.
setup
(dataset_name[, cache_dir, ...])Create, split, and add noise to DataFetcher from input arguments.
split_dataset_by_count
([train_count, ...])Split the covariates and labels to the specified counts.
split_dataset_by_indices
([train_indices, ...])Split the covariates and labels to the specified indices.
split_dataset_by_prop
([train_prop, ...])Split the covariates and labels to the specified proportions.
Attributes
covar_dim
Get covar dimensions.
datapoints
Return split data points to be input into a DataEvaluator as tensors.
label_dim
Get label dimensions.
num_points
Get total number of data points.