opendataval.dataloader package#

Subpackages#

opendataval.dataloader.datasets package

Submodules#

opendataval.dataloader.fetcher module#

class opendataval.dataloader.fetcher.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Bases: object

Load data for an experiment from an input data set name.

Facade for Register object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.

Parameters#

dataset_namestr: Name of the data set, must be registered with Register
cache_dirstr, optional: Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR
force_downloadbool, optional: Forces download from source URL, by default False
random_stateRandomState, optional: Random initial state, by default None

Attributes#

datapointstuple[torch.Tensor, …]: Train+Valid+Test covariates and labels
covar_dimtuple[int, …]: Covariates dimension of the loaded data set.
label_dimtuple[int, …]: Label dimension of the loaded data set.
num_pointsint: Number of data points in the total data set
one_hotbool: If True, the data set has categorical labels as one hot encodings
[train/valid/test]_indicesnp.ndarray[int]: The indices of the original data set used to make the training data set.
noisy_train_indicesnp.ndarray[int]: The indices of training data points with noise added to them.
covarDataset | np.ndarray: Covariate dataset a dataset function
lablesnp.ndarray: Corresponding labels for covariates from a dataset function
[x/y]_[train/valid/test]np.ndarray: Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.

Raises#

KeyError: In order to use a data set, you must register it by creating a Register
ValueError: Loaded Data set covariates and labels must be of same length.
ValueError: All covariates must be of same dimension. All labels must be of same dimension.
ValueError: Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error
ValueError: Specified indices must not repeat and must not be outside range of the data set

property covar_dim: tuple[int, ...]#: Get covar dimensions.

property datapoints#

Return split data points to be input into a DataEvaluator as tensors.

Returns#

(torch.Tensor | Dataset, torch.Tensor): Training Covariates, Training Labels
(torch.Tensor | Dataset, torch.Tensor): Validation Covariates, Valid Labels
(torch.Tensor | Dataset, torch.Tensor): Test Covariates, Test Labels

static datasets_available() → set[str]#: Get set of available data set names.

export_dataset(covariates_names: list[str], labels_names: list[str], output_directory: Path = PosixPath('/home/runner/work/opendataval/opendataval/docs'))#

classmethod from_data(covar: Dataset | ndarray, labels: ndarray, one_hot: bool, random_state: RandomState | None = None)#

Return DataFetcher from input Covariates and Labels.

Parameters#

covarUnion[Dataset, np.ndarray]: Input covariates
labelsnp.ndarray: Input labels, no transformation is applied, therefore if the input data should be one hot encoded, the transform is not applied
one_hotbool: Whether the input data has already been one hot encoded. This is just a flag and not transform will be applied
random_stateRandomState, optional: Initial random state, by default None

Raises#

ValueError: Input covariates and labels are of different length, no 1-to-1 mapping.

classmethod from_data_splits(x_train: Dataset | ndarray, y_train: ndarray, x_valid: Dataset | ndarray, y_valid: ndarray, x_test: Dataset | ndarray, y_test: ndarray, one_hot: bool, random_state: RandomState | None = None)#

Return DataFetcher from already split data.

Parameters#

x_trainUnion[Dataset, np.ndarray]: Input training covariates
y_trainnp.ndarray: Input training labels
x_validUnion[Dataset, np.ndarray]: Input validation covariates
y_validnp.ndarray: Input validation labels
x_testUnion[Dataset, np.ndarray]: Input testing covariates
y_testnp.ndarray: Input testing labels
one_hotbool: Whether the label data has already been one hot encoded. This is just a flag and not transform will be applied
random_stateRandomState, optional: Initial random state, by default None

Raises#

ValueError: Loaded Data set covariates and labels must be of same length.
ValueError: All covariates must be of same dimension. All labels must be of same dimension.

property label_dim: tuple[int, ...]#: Get label dimensions.

noisify(add_noise: Callable[[Self, Any, ...], dict[str, Any]] | str | None = None, *noise_args, **noise_kwargs)#

Add noise to the data points.

Adds noise to the data set and saves the indices of the noisy data. Return object of add_noise is a dict with keys to signify how the data are updated: {‘x_train’,’y_train’,’x_valid’,’y_valid’,’x_test’,’y_test’,’noisy_train_indices’}

Parameters#

add_noiseCallable

If None, no changes are made. Takes as argument required arguments DataFetcher and adds noise to those the data points of DataFetcher as needed. Returns dict[str, np.ndarray] that has the updated np.ndarray in a dict to update the data loader with the following keys:

“x_train” – Updated training covariates with noise, optional
“y_train” – Updated training labels with noise, optional
“x_valid” – Updated validation covariates with noise, optional
“y_valid” – Updated validation labels with noise, optional
“x_test” – Updated testing covariates with noise, optional
“y_test” – Updated testing labels with noise, optional
“noisy_train_indices” – Indices of training data set with noise.

argstuple[Any]

Additional positional arguments passed to add_noise

kwargs: dict[str, Any]

Additional key word arguments passed to add_noise

Returns#

selfobject: Returns a DataFetcher with noise added to the data set.

property num_points: int#: Get total number of data points.

classmethod setup(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None, train_count: int | float = 0, valid_count: int | float = 0, test_count: int | float = 0, add_noise: Callable[[Self, Any, ...], dict[str, Any]] | None = None, noise_kwargs: dict[str, Any] | None = None)#: Create, split, and add noise to DataFetcher from input arguments.

split_dataset_by_count(train_count: int = 0, valid_count: int = 0, test_count: int = 0)#

Split the covariates and labels to the specified counts.

Parameters#

train_countint: Number/proportion training points
valid_countint: Number/proportion validation points
test_countint: Number/proportion test points

Returns#

selfobject: Returns a DataFetcher with covariates, labels split into train/valid/test.

Raises#

ValueError: Invalid input for splitting the data set, either the proportion is more than 1 or the total splits are greater than the len(dataset)

split_dataset_by_indices(train_indices: Sequence[int] | None = None, valid_indices: Sequence[int] | None = None, test_indices: Sequence[int] | None = None)#

Split the covariates and labels to the specified indices.

Parameters#

train_indicesSequence[int]: Indices of training data set
valid_indicesSequence[int]: Indices of valid data set
test_indicesSequence[int]: Indices of test data set

Returns#

selfobject: Returns a DataFetcher with covariates, labels split into train/valid/test.

Raises#

ValueError: Invalid input for indices of the train, valid, or split data set, leak of at least 1 data point in the indices.

split_dataset_by_prop(train_prop: float = 0.0, valid_prop: float = 0.0, test_prop: float = 0.0)#: Split the covariates and labels to the specified proportions.

opendataval.dataloader.noisify module#

class opendataval.dataloader.noisify.NoiseFunc(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: FuncEnum

ADD_GAUSS_NOISE = add_gauss_noise#

MIX_LABELS = mix_labels#

opendataval.dataloader.noisify.add_gauss_noise(fetcher: DataFetcher, noise_rate: float = 0.2, mu: float = 0.0, sigma: float = 1.0) → dict[str, Dataset | ndarray]#

Add gaussian noise to covariates.

Parameters#

fetcherDataFetcher: DataFetcher object housing the data to have noise added to
noise_ratefloat: Proportion of labels to add noise to
mufloat, optional: Center of gaussian distribution which noise is generated from, by default 0
sigmafloat, optional: Standard deviation of gaussian distribution, by default 1

Returns#

dict[str, np.ndarray]

dictionary of updated data points

“x_train” – Updated training covariates with added gaussian noise
“noisy_train_indices” – Indices of training data set with mixed labels

opendataval.dataloader.noisify.mix_labels(fetcher: DataFetcher, noise_rate: float = 0.2) → dict[str, ndarray]#

Mixes y_train labels of a DataFetcher, adding noise to data.

For a given set of unique labels, we shift the label forward up to n-1 steps. This prevents selecting the same label when noise is added.

Parameters#

fetcherDataFetcher: DataFetcher object housing the data to have noise added to
noise_ratefloat: Proportion of labels to add noise to

Returns#

dict[str, np.ndarray]

dictionary of updated data points

“y_train” – Updated training labels mixed
“y_valid” – Updated validation labels mixed
“noisy_train_indices” – Indices of training data set with mixed labels

opendataval.dataloader.register module#

class opendataval.dataloader.register.Register(dataset_name: str, one_hot: bool = False, cacheable: bool = False, presplit: bool = False)#

Bases: object

Registers data sets to be fetched by the DataFetcher. Also allows specific transformations to be applied on a data set. This gives the benefit of creating Register objects to distinguish separate data sets

Parameters#

dataset_namestr: Data set name
one_hotbool, optional: Whether the data set is one hot encoded labeled, by default False
cacheablebool, optional: Whether data set can be downloaded and cached, by default False
presplitbool, optional: Whether the data set was presplit, by default False

Warns#

Warning: Register keeps track of all data set names registered and all must be unique. If there are any duplicates, warns user.

CACHE_DIR = 'data_files'#: Default directory to cache downloads to.

Datasets: ClassVar[dict[str, Self]] = {'2dplanes': <opendataval.dataloader.register.Register object>, 'MiniBooNE': <opendataval.dataloader.register.Register object>, 'adult': <opendataval.dataloader.register.Register object>, 'bbc': <opendataval.dataloader.register.Register object>, 'bbc-embeddings': <opendataval.dataloader.register.Register object>, 'breast_cancer': <opendataval.dataloader.register.Register object>, 'challenge-iris': <opendataval.dataloader.register.Register object>, 'cifar10': <opendataval.dataloader.register.Register object>, 'cifar10-embeddings': <opendataval.dataloader.register.Register object>, 'cifar100': <opendataval.dataloader.register.Register object>, 'creditcard': <opendataval.dataloader.register.Register object>, 'diabetes': <opendataval.dataloader.register.Register object>, 'digits': <opendataval.dataloader.register.Register object>, 'echoMonths': <opendataval.dataloader.register.Register object>, 'election': <opendataval.dataloader.register.Register object>, 'electricity': <opendataval.dataloader.register.Register object>, 'fashion': <opendataval.dataloader.register.Register object>, 'fried': <opendataval.dataloader.register.Register object>, 'gaussian_classifier': <opendataval.dataloader.register.Register object>, 'gaussian_classifier_high_dim': <opendataval.dataloader.register.Register object>, 'imdb': <opendataval.dataloader.register.Register object>, 'imdb-embeddings': <opendataval.dataloader.register.Register object>, 'iris': <opendataval.dataloader.register.Register object>, 'linnerud': <opendataval.dataloader.register.Register object>, 'lowbwt': <opendataval.dataloader.register.Register object>, 'mnist': <opendataval.dataloader.register.Register object>, 'mv': <opendataval.dataloader.register.Register object>, 'nomao': <opendataval.dataloader.register.Register object>, 'pol': <opendataval.dataloader.register.Register object>, 'stl10-embeddings': <opendataval.dataloader.register.Register object>, 'stock': <opendataval.dataloader.register.Register object>, 'svhn-embeddings': <opendataval.dataloader.register.Register object>, 'wave_energy': <opendataval.dataloader.register.Register object>}#: Creates a directory for all registered/downloadable data set functions.

add_covar_transform(transform: Callable[[ndarray], ndarray])#: Add covariate transform after data is fetched.

add_label_transform(transform: Callable[[ndarray], ndarray])#: Add label transform after data is fetched.

from_covar_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) → Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#: Register data set from 2 Callables, registers covariates Callable.

from_covar_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) → Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#: Register data set from Callable -> (covariates, labels).

from_csv(filepath: str, label_columns: str | list)#: Register data set from csv file.

from_data(covar: ndarray, label: ndarray, one_hot: bool = False)#: Register data set from covariate and label numpy array.

from_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) → Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#: Register data set from 2 Callables, registers labels Callable.

from_numpy(array: ndarray, label_columns: int | Sequence[int])#: Register data set from covariate and label numpy array.

from_pandas(df: DataFrame, label_columns: str | list)#: Register data set from pandas data frame.

load_data(cache_dir: str | None = None, force_download: bool = False) → tuple[Dataset, ndarray]#

Retrieve data from specified data input functions.

Loads the covariates and labels from the registered callables, applies transformations, and returns the covariates and labels.

Parameters#

cache_dirstr, optional: Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR
force_downloadbool, optional: Forces download from source URL, by default False

Returns#

(np.ndarray | Dataset, np.ndarray): Transformed covariates and labels of the data set

opendataval.dataloader.register.cache(url: str, cache_dir: Path, file_name: str | None = None, force_download: bool = False) → Path#

Download a file if it it is not present and returns the filepath.

Parameters#

urlstr: URL of the file to be downloaded
cache_dirstr: Directory to cache downloaded files
file_namestr, optional: File name within the cache directory of the downloaded file, by default None
force_downloadbool, optional: Forces a download regardless if file is present, by default False

Returns#

str: File path to the downloaded file

Raises#

HTTPError: HTTP error occurred during downloading the dataset.

opendataval.dataloader.register.one_hot_encode(data: ndarray) → ndarray#

One hot encodes a numpy array.

Raises#

ValueError: When the input array is not of shape (N,), (N,1), (N,1,1)…

opendataval.dataloader.util module#

class opendataval.dataloader.util.CatDataset(*datasets: list[Dataset[Any]])#

Bases: Dataset[tuple[Dataset, …]]

Data set wrapping indexable Datasets.

Parameters#

datasetstuple[Dataset]: Tuple of data sets we would like to concat together, must be same length

Raises#

ValueError: If all input data sets are not the same length

class opendataval.dataloader.util.FolderDataset(folder_path: Path, sizes: list[int] | None = None)#

Bases: Dataset

Dataset for tensors within a folder.

BATCH_CACHE = 5#

static exists(path: Path)#

format_batch_path(batch_index: int) → str#

get_batch(batch_index: int) → Tensor#

classmethod load(path: Path)#: Loads existing gradient dataset metadata from path/.metadata.pkl

property metadata: dict[str, Any]#: Important metadata defining a GradientDataset, used for loading.

save()#: Saves metadata to disk, allows us to load GradientDataset as needed.

property shape: tuple[int, ...]#

write(batch_number: int, data: Tensor)#

class opendataval.dataloader.util.IndexTransformDataset(dataset: Dataset[T_co], index_transformation: Callable[[T_co, int], T_co] | None = None)#

Bases: Dataset[T_co]

Data set wrapper that allows a per-index transform to be applied.

Primarily useful when adding noise to specific subset of indices. If a transform is defined, it will apply the transformation but also pass in the indices (what is passed into __getitem__) as well.

Parameters#

datasetDataset[T_co]: Data set with transform to be applied
index_transformationCallable[[T_co, Sequence[int]], T_co], optional: Function that takes input sequence of ints and data and applies the specific transform per index, by default None which is no transform.

property transform: Callable[[T_co, int], T_co]#: Gets the transform function, if None, no transformation applied.

class opendataval.dataloader.util.ListDataset(input_list: Sequence[T_co])#

Bases: Dataset[T_co]

Data set wrapping a list.

ListDataset is primarily useful to when you want to pass back a list but also want to get around the type checks of Datasets. This is intended to be used with NLP data sets as the the axis 1 dimension is variable and BERT tokenizers take inputs as only lists.

Parameters#

input_listSequence[T_co]: Input sequence to be used as data set.

opendataval.dataloader.util.load_tensor(tensor_path: Path) → Tensor#

Module contents#

Create data sets and loads with DataFetcher.

Data Loader#

Provides an API to add new data sets and load them with the data loader. To create a new data set, create a Register object to register the data set with a name. Then load the data set with DataFetcher. This allows us the flexibility to call the dataset later and to define separate functions/classes for the covariates and labels of a data set

Creating/Loading data sets#

`Register`(dataset_name[, one_hot, cacheable, ...])	Register a data set by defining its name and adding functions to retrieve data.
`DataFetcher`(dataset_name[, cache_dir, ...])	Load data for an experiment from an input data set name.
`datasets`	Data sets registered with `Register`.

Utils#

`cache`(url, cache_dir[, file_name, ...])	Download a file if it it is not present and returns the filepath.
`mix_labels`(fetcher[, noise_rate])	Mixes y_train labels of a DataFetcher, adding noise to data.
`one_hot_encode`(data)	One hot encodes a numpy array.
`CatDataset`(*datasets)	Data set wrapping indexable Datasets.