opendataval.dataloader package#

Subpackages#

Submodules#

opendataval.dataloader.fetcher module#

class opendataval.dataloader.fetcher.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#

Bases: object

Load data for an experiment from an input data set name.

Facade for Register object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.

Parameters#

dataset_namestr

Name of the data set, must be registered with Register

cache_dirstr, optional

Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR

force_downloadbool, optional

Forces download from source URL, by default False

random_stateRandomState, optional

Random initial state, by default None

Attributes#

datapointstuple[torch.Tensor, …]

Train+Valid+Test covariates and labels

covar_dimtuple[int, …]

Covariates dimension of the loaded data set.

label_dimtuple[int, …]

Label dimension of the loaded data set.

num_pointsint

Number of data points in the total data set

one_hotbool

If True, the data set has categorical labels as one hot encodings

[train/valid/test]_indicesnp.ndarray[int]

The indices of the original data set used to make the training data set.

noisy_train_indicesnp.ndarray[int]

The indices of training data points with noise added to them.

covarDataset | np.ndarray

Covariate dataset a dataset function

lablesnp.ndarray

Corresponding labels for covariates from a dataset function

[x/y]_[train/valid/test]np.ndarray

Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.

Raises#

KeyError

In order to use a data set, you must register it by creating a Register

ValueError

Loaded Data set covariates and labels must be of same length.

ValueError

All covariates must be of same dimension. All labels must be of same dimension.

ValueError

Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error

ValueError

Specified indices must not repeat and must not be outside range of the data set

property covar_dim: tuple[int, ...]#

Get covar dimensions.

property datapoints#

Return split data points to be input into a DataEvaluator as tensors.

Returns#

(torch.Tensor | Dataset, torch.Tensor)

Training Covariates, Training Labels

(torch.Tensor | Dataset, torch.Tensor)

Validation Covariates, Valid Labels

(torch.Tensor | Dataset, torch.Tensor)

Test Covariates, Test Labels

static datasets_available() set[str]#

Get set of available data set names.

export_dataset(covariates_names: list[str], labels_names: list[str], output_directory: Path = PosixPath('/home/runner/work/opendataval/opendataval/docs'))#
classmethod from_data(covar: Dataset | ndarray, labels: ndarray, one_hot: bool, random_state: RandomState | None = None)#

Return DataFetcher from input Covariates and Labels.

Parameters#

covarUnion[Dataset, np.ndarray]

Input covariates

labelsnp.ndarray

Input labels, no transformation is applied, therefore if the input data should be one hot encoded, the transform is not applied

one_hotbool

Whether the input data has already been one hot encoded. This is just a flag and not transform will be applied

random_stateRandomState, optional

Initial random state, by default None

Raises#

ValueError

Input covariates and labels are of different length, no 1-to-1 mapping.

classmethod from_data_splits(x_train: Dataset | ndarray, y_train: ndarray, x_valid: Dataset | ndarray, y_valid: ndarray, x_test: Dataset | ndarray, y_test: ndarray, one_hot: bool, random_state: RandomState | None = None)#

Return DataFetcher from already split data.

Parameters#

x_trainUnion[Dataset, np.ndarray]

Input training covariates

y_trainnp.ndarray

Input training labels

x_validUnion[Dataset, np.ndarray]

Input validation covariates

y_validnp.ndarray

Input validation labels

x_testUnion[Dataset, np.ndarray]

Input testing covariates

y_testnp.ndarray

Input testing labels

one_hotbool

Whether the label data has already been one hot encoded. This is just a flag and not transform will be applied

random_stateRandomState, optional

Initial random state, by default None

Raises#

ValueError

Loaded Data set covariates and labels must be of same length.

ValueError

All covariates must be of same dimension. All labels must be of same dimension.

property label_dim: tuple[int, ...]#

Get label dimensions.

noisify(add_noise: Callable[[Self, Any, ...], dict[str, Any]] | str | None = None, *noise_args, **noise_kwargs)#

Add noise to the data points.

Adds noise to the data set and saves the indices of the noisy data. Return object of add_noise is a dict with keys to signify how the data are updated: {‘x_train’,’y_train’,’x_valid’,’y_valid’,’x_test’,’y_test’,’noisy_train_indices’}

Parameters#

add_noiseCallable

If None, no changes are made. Takes as argument required arguments DataFetcher and adds noise to those the data points of DataFetcher as needed. Returns dict[str, np.ndarray] that has the updated np.ndarray in a dict to update the data loader with the following keys:

  • “x_train” – Updated training covariates with noise, optional

  • “y_train” – Updated training labels with noise, optional

  • “x_valid” – Updated validation covariates with noise, optional

  • “y_valid” – Updated validation labels with noise, optional

  • “x_test” – Updated testing covariates with noise, optional

  • “y_test” – Updated testing labels with noise, optional

  • “noisy_train_indices” – Indices of training data set with noise.

argstuple[Any]

Additional positional arguments passed to add_noise

kwargs: dict[str, Any]

Additional key word arguments passed to add_noise

Returns#

selfobject

Returns a DataFetcher with noise added to the data set.

property num_points: int#

Get total number of data points.

classmethod setup(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None, train_count: int | float = 0, valid_count: int | float = 0, test_count: int | float = 0, add_noise: Callable[[Self, Any, ...], dict[str, Any]] | None = None, noise_kwargs: dict[str, Any] | None = None)#

Create, split, and add noise to DataFetcher from input arguments.

split_dataset_by_count(train_count: int = 0, valid_count: int = 0, test_count: int = 0)#

Split the covariates and labels to the specified counts.

Parameters#

train_countint

Number/proportion training points

valid_countint

Number/proportion validation points

test_countint

Number/proportion test points

Returns#

selfobject

Returns a DataFetcher with covariates, labels split into train/valid/test.

Raises#

ValueError

Invalid input for splitting the data set, either the proportion is more than 1 or the total splits are greater than the len(dataset)

split_dataset_by_indices(train_indices: Sequence[int] | None = None, valid_indices: Sequence[int] | None = None, test_indices: Sequence[int] | None = None)#

Split the covariates and labels to the specified indices.

Parameters#

train_indicesSequence[int]

Indices of training data set

valid_indicesSequence[int]

Indices of valid data set

test_indicesSequence[int]

Indices of test data set

Returns#

selfobject

Returns a DataFetcher with covariates, labels split into train/valid/test.

Raises#

ValueError

Invalid input for indices of the train, valid, or split data set, leak of at least 1 data point in the indices.

split_dataset_by_prop(train_prop: float = 0.0, valid_prop: float = 0.0, test_prop: float = 0.0)#

Split the covariates and labels to the specified proportions.

opendataval.dataloader.noisify module#

class opendataval.dataloader.noisify.NoiseFunc(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: FuncEnum

ADD_GAUSS_NOISE = add_gauss_noise#
MIX_LABELS = mix_labels#
opendataval.dataloader.noisify.add_gauss_noise(fetcher: DataFetcher, noise_rate: float = 0.2, mu: float = 0.0, sigma: float = 1.0) dict[str, Dataset | ndarray]#

Add gaussian noise to covariates.

Parameters#

fetcherDataFetcher

DataFetcher object housing the data to have noise added to

noise_ratefloat

Proportion of labels to add noise to

mufloat, optional

Center of gaussian distribution which noise is generated from, by default 0

sigmafloat, optional

Standard deviation of gaussian distribution, by default 1

Returns#

dict[str, np.ndarray]

dictionary of updated data points

  • “x_train” – Updated training covariates with added gaussian noise

  • “noisy_train_indices” – Indices of training data set with mixed labels

opendataval.dataloader.noisify.mix_labels(fetcher: DataFetcher, noise_rate: float = 0.2) dict[str, ndarray]#

Mixes y_train labels of a DataFetcher, adding noise to data.

For a given set of unique labels, we shift the label forward up to n-1 steps. This prevents selecting the same label when noise is added.

Parameters#

fetcherDataFetcher

DataFetcher object housing the data to have noise added to

noise_ratefloat

Proportion of labels to add noise to

Returns#

dict[str, np.ndarray]

dictionary of updated data points

  • “y_train” – Updated training labels mixed

  • “y_valid” – Updated validation labels mixed

  • “noisy_train_indices” – Indices of training data set with mixed labels

opendataval.dataloader.register module#

class opendataval.dataloader.register.Register(dataset_name: str, one_hot: bool = False, cacheable: bool = False, presplit: bool = False)#

Bases: object

Register a data set by defining its name and adding functions to retrieve data.

Registers data sets to be fetched by the DataFetcher. Also allows specific transformations to be applied on a data set. This gives the benefit of creating Register objects to distinguish separate data sets

Parameters#

dataset_namestr

Data set name

one_hotbool, optional

Whether the data set is one hot encoded labeled, by default False

cacheablebool, optional

Whether data set can be downloaded and cached, by default False

presplitbool, optional

Whether the data set was presplit, by default False

Warns#

Warning

Register keeps track of all data set names registered and all must be unique. If there are any duplicates, warns user.

CACHE_DIR = 'data_files'#

Default directory to cache downloads to.

Datasets: ClassVar[dict[str, Self]] = {'2dplanes': <opendataval.dataloader.register.Register object>, 'MiniBooNE': <opendataval.dataloader.register.Register object>, 'adult': <opendataval.dataloader.register.Register object>, 'bbc': <opendataval.dataloader.register.Register object>, 'bbc-embeddings': <opendataval.dataloader.register.Register object>, 'breast_cancer': <opendataval.dataloader.register.Register object>, 'challenge-iris': <opendataval.dataloader.register.Register object>, 'cifar10': <opendataval.dataloader.register.Register object>, 'cifar10-embeddings': <opendataval.dataloader.register.Register object>, 'cifar100': <opendataval.dataloader.register.Register object>, 'creditcard': <opendataval.dataloader.register.Register object>, 'diabetes': <opendataval.dataloader.register.Register object>, 'digits': <opendataval.dataloader.register.Register object>, 'echoMonths': <opendataval.dataloader.register.Register object>, 'election': <opendataval.dataloader.register.Register object>, 'electricity': <opendataval.dataloader.register.Register object>, 'fashion': <opendataval.dataloader.register.Register object>, 'fried': <opendataval.dataloader.register.Register object>, 'gaussian_classifier': <opendataval.dataloader.register.Register object>, 'gaussian_classifier_high_dim': <opendataval.dataloader.register.Register object>, 'imdb': <opendataval.dataloader.register.Register object>, 'imdb-embeddings': <opendataval.dataloader.register.Register object>, 'iris': <opendataval.dataloader.register.Register object>, 'linnerud': <opendataval.dataloader.register.Register object>, 'lowbwt': <opendataval.dataloader.register.Register object>, 'mnist': <opendataval.dataloader.register.Register object>, 'mv': <opendataval.dataloader.register.Register object>, 'nomao': <opendataval.dataloader.register.Register object>, 'pol': <opendataval.dataloader.register.Register object>, 'stl10-embeddings': <opendataval.dataloader.register.Register object>, 'stock': <opendataval.dataloader.register.Register object>, 'svhn-embeddings': <opendataval.dataloader.register.Register object>, 'wave_energy': <opendataval.dataloader.register.Register object>}#

Creates a directory for all registered/downloadable data set functions.

add_covar_transform(transform: Callable[[ndarray], ndarray])#

Add covariate transform after data is fetched.

add_label_transform(transform: Callable[[ndarray], ndarray])#

Add label transform after data is fetched.

from_covar_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#

Register data set from 2 Callables, registers covariates Callable.

from_covar_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#

Register data set from Callable -> (covariates, labels).

from_csv(filepath: str, label_columns: str | list)#

Register data set from csv file.

from_data(covar: ndarray, label: ndarray, one_hot: bool = False)#

Register data set from covariate and label numpy array.

from_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]]#

Register data set from 2 Callables, registers labels Callable.

from_numpy(array: ndarray, label_columns: int | Sequence[int])#

Register data set from covariate and label numpy array.

from_pandas(df: DataFrame, label_columns: str | list)#

Register data set from pandas data frame.

load_data(cache_dir: str | None = None, force_download: bool = False) tuple[Dataset, ndarray]#

Retrieve data from specified data input functions.

Loads the covariates and labels from the registered callables, applies transformations, and returns the covariates and labels.

Parameters#

cache_dirstr, optional

Directory of where to cache the loaded data, by default None which uses Register.CACHE_DIR

force_downloadbool, optional

Forces download from source URL, by default False

Returns#

(np.ndarray | Dataset, np.ndarray)

Transformed covariates and labels of the data set

opendataval.dataloader.register.cache(url: str, cache_dir: Path, file_name: str | None = None, force_download: bool = False) Path#

Download a file if it it is not present and returns the filepath.

Parameters#

urlstr

URL of the file to be downloaded

cache_dirstr

Directory to cache downloaded files

file_namestr, optional

File name within the cache directory of the downloaded file, by default None

force_downloadbool, optional

Forces a download regardless if file is present, by default False

Returns#

str

File path to the downloaded file

Raises#

HTTPError

HTTP error occurred during downloading the dataset.

opendataval.dataloader.register.one_hot_encode(data: ndarray) ndarray#

One hot encodes a numpy array.

Raises#

ValueError

When the input array is not of shape (N,), (N,1), (N,1,1)…

opendataval.dataloader.util module#

class opendataval.dataloader.util.CatDataset(*datasets: list[Dataset[Any]])#

Bases: Dataset[tuple[Dataset, …]]

Data set wrapping indexable Datasets.

Parameters#

datasetstuple[Dataset]

Tuple of data sets we would like to concat together, must be same length

Raises#

ValueError

If all input data sets are not the same length

class opendataval.dataloader.util.FolderDataset(folder_path: Path, sizes: list[int] | None = None)#

Bases: Dataset

Dataset for tensors within a folder.

BATCH_CACHE = 5#
static exists(path: Path)#
format_batch_path(batch_index: int) str#
get_batch(batch_index: int) Tensor#
classmethod load(path: Path)#

Loads existing gradient dataset metadata from path/.metadata.pkl

property metadata: dict[str, Any]#

Important metadata defining a GradientDataset, used for loading.

save()#

Saves metadata to disk, allows us to load GradientDataset as needed.

property shape: tuple[int, ...]#
write(batch_number: int, data: Tensor)#
class opendataval.dataloader.util.IndexTransformDataset(dataset: Dataset[T_co], index_transformation: Callable[[T_co, int], T_co] | None = None)#

Bases: Dataset[T_co]

Data set wrapper that allows a per-index transform to be applied.

Primarily useful when adding noise to specific subset of indices. If a transform is defined, it will apply the transformation but also pass in the indices (what is passed into __getitem__) as well.

Parameters#

datasetDataset[T_co]

Data set with transform to be applied

index_transformationCallable[[T_co, Sequence[int]], T_co], optional

Function that takes input sequence of ints and data and applies the specific transform per index, by default None which is no transform.

property transform: Callable[[T_co, int], T_co]#

Gets the transform function, if None, no transformation applied.

class opendataval.dataloader.util.ListDataset(input_list: Sequence[T_co])#

Bases: Dataset[T_co]

Data set wrapping a list.

ListDataset is primarily useful to when you want to pass back a list but also want to get around the type checks of Datasets. This is intended to be used with NLP data sets as the the axis 1 dimension is variable and BERT tokenizers take inputs as only lists.

Parameters#

input_listSequence[T_co]

Input sequence to be used as data set.

opendataval.dataloader.util.load_tensor(tensor_path: Path) Tensor#

Module contents#

Create data sets and loads with DataFetcher.

Data Loader#

Provides an API to add new data sets and load them with the data loader. To create a new data set, create a Register object to register the data set with a name. Then load the data set with DataFetcher. This allows us the flexibility to call the dataset later and to define separate functions/classes for the covariates and labels of a data set

Creating/Loading data sets#

Register(dataset_name[, one_hot, cacheable, ...])

Register a data set by defining its name and adding functions to retrieve data.

DataFetcher(dataset_name[, cache_dir, ...])

Load data for an experiment from an input data set name.

datasets

Data sets registered with Register.

Utils#

cache(url, cache_dir[, file_name, ...])

Download a file if it it is not present and returns the filepath.

mix_labels(fetcher[, noise_rate])

Mixes y_train labels of a DataFetcher, adding noise to data.

one_hot_encode(data)

One hot encodes a numpy array.

CatDataset(*datasets)

Data set wrapping indexable Datasets.