opendataval.dataloader package#
Subpackages#
Submodules#
opendataval.dataloader.fetcher module#
- class opendataval.dataloader.fetcher.DataFetcher(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None)#
Bases:
object
Load data for an experiment from an input data set name.
Facade for
Register
object, prepares the data and provides an API for subsequent splitting, adding noise, and transforming into a tensor.Parameters#
- dataset_namestr
Name of the data set, must be registered with
Register
- cache_dirstr, optional
Directory of where to cache the loaded data, by default None which uses
Register.CACHE_DIR
- force_downloadbool, optional
Forces download from source URL, by default False
- random_stateRandomState, optional
Random initial state, by default None
Attributes#
- datapointstuple[torch.Tensor, …]
Train+Valid+Test covariates and labels
- covar_dimtuple[int, …]
Covariates dimension of the loaded data set.
- label_dimtuple[int, …]
Label dimension of the loaded data set.
- num_pointsint
Number of data points in the total data set
- one_hotbool
If True, the data set has categorical labels as one hot encodings
- [train/valid/test]_indicesnp.ndarray[int]
The indices of the original data set used to make the training data set.
- noisy_train_indicesnp.ndarray[int]
The indices of training data points with noise added to them.
- covarDataset | np.ndarray
Covariate dataset a dataset function
- lablesnp.ndarray
Corresponding labels for covariates from a dataset function
- [x/y]_[train/valid/test]np.ndarray
Access to the raw split of the [covariate/label] [train/valid/test] data set prior being transformed into a tensor. Useful for adding noise to functions.
Raises#
- KeyError
In order to use a data set, you must register it by creating a
Register
- ValueError
Loaded Data set covariates and labels must be of same length.
- ValueError
All covariates must be of same dimension. All labels must be of same dimension.
- ValueError
Splits must not exceed the length of the data set. In other words, if the splits are ints, the values must be less than the length. If they are floats they must be less than 1.0. If they are anything else, raises error
- ValueError
Specified indices must not repeat and must not be outside range of the data set
- property covar_dim: tuple[int, ...]#
Get covar dimensions.
- property datapoints#
Return split data points to be input into a DataEvaluator as tensors.
Returns#
- (torch.Tensor | Dataset, torch.Tensor)
Training Covariates, Training Labels
- (torch.Tensor | Dataset, torch.Tensor)
Validation Covariates, Valid Labels
- (torch.Tensor | Dataset, torch.Tensor)
Test Covariates, Test Labels
- static datasets_available() set[str] #
Get set of available data set names.
- export_dataset(covariates_names: list[str], labels_names: list[str], output_directory: Path = PosixPath('/home/runner/work/opendataval/opendataval/docs'))#
- classmethod from_data(covar: Dataset | ndarray, labels: ndarray, one_hot: bool, random_state: RandomState | None = None)#
Return DataFetcher from input Covariates and Labels.
Parameters#
- covarUnion[Dataset, np.ndarray]
Input covariates
- labelsnp.ndarray
Input labels, no transformation is applied, therefore if the input data should be one hot encoded, the transform is not applied
- one_hotbool
Whether the input data has already been one hot encoded. This is just a flag and not transform will be applied
- random_stateRandomState, optional
Initial random state, by default None
Raises#
- ValueError
Input covariates and labels are of different length, no 1-to-1 mapping.
- classmethod from_data_splits(x_train: Dataset | ndarray, y_train: ndarray, x_valid: Dataset | ndarray, y_valid: ndarray, x_test: Dataset | ndarray, y_test: ndarray, one_hot: bool, random_state: RandomState | None = None)#
Return DataFetcher from already split data.
Parameters#
- x_trainUnion[Dataset, np.ndarray]
Input training covariates
- y_trainnp.ndarray
Input training labels
- x_validUnion[Dataset, np.ndarray]
Input validation covariates
- y_validnp.ndarray
Input validation labels
- x_testUnion[Dataset, np.ndarray]
Input testing covariates
- y_testnp.ndarray
Input testing labels
- one_hotbool
Whether the label data has already been one hot encoded. This is just a flag and not transform will be applied
- random_stateRandomState, optional
Initial random state, by default None
Raises#
- ValueError
Loaded Data set covariates and labels must be of same length.
- ValueError
All covariates must be of same dimension. All labels must be of same dimension.
- property label_dim: tuple[int, ...]#
Get label dimensions.
- noisify(add_noise: Callable[[Self, Any, ...], dict[str, Any]] | str | None = None, *noise_args, **noise_kwargs)#
Add noise to the data points.
Adds noise to the data set and saves the indices of the noisy data. Return object of add_noise is a dict with keys to signify how the data are updated: {‘x_train’,’y_train’,’x_valid’,’y_valid’,’x_test’,’y_test’,’noisy_train_indices’}
Parameters#
- add_noiseCallable
If None, no changes are made. Takes as argument required arguments DataFetcher and adds noise to those the data points of DataFetcher as needed. Returns dict[str, np.ndarray] that has the updated np.ndarray in a dict to update the data loader with the following keys:
“x_train” – Updated training covariates with noise, optional
“y_train” – Updated training labels with noise, optional
“x_valid” – Updated validation covariates with noise, optional
“y_valid” – Updated validation labels with noise, optional
“x_test” – Updated testing covariates with noise, optional
“y_test” – Updated testing labels with noise, optional
“noisy_train_indices” – Indices of training data set with noise.
- argstuple[Any]
Additional positional arguments passed to
add_noise
- kwargs: dict[str, Any]
Additional key word arguments passed to
add_noise
Returns#
- selfobject
Returns a DataFetcher with noise added to the data set.
- property num_points: int#
Get total number of data points.
- classmethod setup(dataset_name: str, cache_dir: str | None = None, force_download: bool = False, random_state: RandomState | None = None, train_count: int | float = 0, valid_count: int | float = 0, test_count: int | float = 0, add_noise: Callable[[Self, Any, ...], dict[str, Any]] | None = None, noise_kwargs: dict[str, Any] | None = None)#
Create, split, and add noise to DataFetcher from input arguments.
- split_dataset_by_count(train_count: int = 0, valid_count: int = 0, test_count: int = 0)#
Split the covariates and labels to the specified counts.
Parameters#
- train_countint
Number/proportion training points
- valid_countint
Number/proportion validation points
- test_countint
Number/proportion test points
Returns#
- selfobject
Returns a DataFetcher with covariates, labels split into train/valid/test.
Raises#
- ValueError
Invalid input for splitting the data set, either the proportion is more than 1 or the total splits are greater than the len(dataset)
- split_dataset_by_indices(train_indices: Sequence[int] | None = None, valid_indices: Sequence[int] | None = None, test_indices: Sequence[int] | None = None)#
Split the covariates and labels to the specified indices.
Parameters#
- train_indicesSequence[int]
Indices of training data set
- valid_indicesSequence[int]
Indices of valid data set
- test_indicesSequence[int]
Indices of test data set
Returns#
- selfobject
Returns a DataFetcher with covariates, labels split into train/valid/test.
Raises#
- ValueError
Invalid input for indices of the train, valid, or split data set, leak of at least 1 data point in the indices.
- split_dataset_by_prop(train_prop: float = 0.0, valid_prop: float = 0.0, test_prop: float = 0.0)#
Split the covariates and labels to the specified proportions.
opendataval.dataloader.noisify module#
- class opendataval.dataloader.noisify.NoiseFunc(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
FuncEnum
- ADD_GAUSS_NOISE = add_gauss_noise#
- MIX_LABELS = mix_labels#
- opendataval.dataloader.noisify.add_gauss_noise(fetcher: DataFetcher, noise_rate: float = 0.2, mu: float = 0.0, sigma: float = 1.0) dict[str, Dataset | ndarray] #
Add gaussian noise to covariates.
Parameters#
- fetcherDataFetcher
DataFetcher object housing the data to have noise added to
- noise_ratefloat
Proportion of labels to add noise to
- mufloat, optional
Center of gaussian distribution which noise is generated from, by default 0
- sigmafloat, optional
Standard deviation of gaussian distribution, by default 1
Returns#
- dict[str, np.ndarray]
dictionary of updated data points
“x_train” – Updated training covariates with added gaussian noise
“noisy_train_indices” – Indices of training data set with mixed labels
- opendataval.dataloader.noisify.mix_labels(fetcher: DataFetcher, noise_rate: float = 0.2) dict[str, ndarray] #
Mixes y_train labels of a DataFetcher, adding noise to data.
For a given set of unique labels, we shift the label forward up to n-1 steps. This prevents selecting the same label when noise is added.
Parameters#
- fetcherDataFetcher
DataFetcher object housing the data to have noise added to
- noise_ratefloat
Proportion of labels to add noise to
Returns#
- dict[str, np.ndarray]
dictionary of updated data points
“y_train” – Updated training labels mixed
“y_valid” – Updated validation labels mixed
“noisy_train_indices” – Indices of training data set with mixed labels
opendataval.dataloader.register module#
- class opendataval.dataloader.register.Register(dataset_name: str, one_hot: bool = False, cacheable: bool = False, presplit: bool = False)#
Bases:
object
Register a data set by defining its name and adding functions to retrieve data.
Registers data sets to be fetched by the DataFetcher. Also allows specific transformations to be applied on a data set. This gives the benefit of creating
Register
objects to distinguish separate data setsParameters#
- dataset_namestr
Data set name
- one_hotbool, optional
Whether the data set is one hot encoded labeled, by default False
- cacheablebool, optional
Whether data set can be downloaded and cached, by default False
- presplitbool, optional
Whether the data set was presplit, by default False
Warns#
- Warning
Register
keeps track of all data set names registered and all must be unique. If there are any duplicates, warns user.
- CACHE_DIR = 'data_files'#
Default directory to cache downloads to.
- Datasets: ClassVar[dict[str, Self]] = {'2dplanes': <opendataval.dataloader.register.Register object>, 'MiniBooNE': <opendataval.dataloader.register.Register object>, 'adult': <opendataval.dataloader.register.Register object>, 'bbc': <opendataval.dataloader.register.Register object>, 'bbc-embeddings': <opendataval.dataloader.register.Register object>, 'breast_cancer': <opendataval.dataloader.register.Register object>, 'challenge-iris': <opendataval.dataloader.register.Register object>, 'cifar10': <opendataval.dataloader.register.Register object>, 'cifar10-embeddings': <opendataval.dataloader.register.Register object>, 'cifar100': <opendataval.dataloader.register.Register object>, 'creditcard': <opendataval.dataloader.register.Register object>, 'diabetes': <opendataval.dataloader.register.Register object>, 'digits': <opendataval.dataloader.register.Register object>, 'echoMonths': <opendataval.dataloader.register.Register object>, 'election': <opendataval.dataloader.register.Register object>, 'electricity': <opendataval.dataloader.register.Register object>, 'fashion': <opendataval.dataloader.register.Register object>, 'fried': <opendataval.dataloader.register.Register object>, 'gaussian_classifier': <opendataval.dataloader.register.Register object>, 'gaussian_classifier_high_dim': <opendataval.dataloader.register.Register object>, 'imdb': <opendataval.dataloader.register.Register object>, 'imdb-embeddings': <opendataval.dataloader.register.Register object>, 'iris': <opendataval.dataloader.register.Register object>, 'linnerud': <opendataval.dataloader.register.Register object>, 'lowbwt': <opendataval.dataloader.register.Register object>, 'mnist': <opendataval.dataloader.register.Register object>, 'mv': <opendataval.dataloader.register.Register object>, 'nomao': <opendataval.dataloader.register.Register object>, 'pol': <opendataval.dataloader.register.Register object>, 'stl10-embeddings': <opendataval.dataloader.register.Register object>, 'stock': <opendataval.dataloader.register.Register object>, 'svhn-embeddings': <opendataval.dataloader.register.Register object>, 'wave_energy': <opendataval.dataloader.register.Register object>}#
Creates a directory for all registered/downloadable data set functions.
- add_covar_transform(transform: Callable[[ndarray], ndarray])#
Add covariate transform after data is fetched.
- add_label_transform(transform: Callable[[ndarray], ndarray])#
Add label transform after data is fetched.
- from_covar_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]] #
Register data set from 2 Callables, registers covariates Callable.
- from_covar_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]] #
Register data set from Callable -> (covariates, labels).
- from_csv(filepath: str, label_columns: str | list)#
Register data set from csv file.
- from_data(covar: ndarray, label: ndarray, one_hot: bool = False)#
Register data set from covariate and label numpy array.
- from_label_func(func: Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]], *args, **kwargs) Callable[[...], Dataset | ndarray | tuple[ndarray, ndarray]] #
Register data set from 2 Callables, registers labels Callable.
- from_numpy(array: ndarray, label_columns: int | Sequence[int])#
Register data set from covariate and label numpy array.
- from_pandas(df: DataFrame, label_columns: str | list)#
Register data set from pandas data frame.
- load_data(cache_dir: str | None = None, force_download: bool = False) tuple[Dataset, ndarray] #
Retrieve data from specified data input functions.
Loads the covariates and labels from the registered callables, applies transformations, and returns the covariates and labels.
Parameters#
- cache_dirstr, optional
Directory of where to cache the loaded data, by default None which uses
Register.CACHE_DIR
- force_downloadbool, optional
Forces download from source URL, by default False
Returns#
- (np.ndarray | Dataset, np.ndarray)
Transformed covariates and labels of the data set
- opendataval.dataloader.register.cache(url: str, cache_dir: Path, file_name: str | None = None, force_download: bool = False) Path #
Download a file if it it is not present and returns the filepath.
Parameters#
- urlstr
URL of the file to be downloaded
- cache_dirstr
Directory to cache downloaded files
- file_namestr, optional
File name within the cache directory of the downloaded file, by default None
- force_downloadbool, optional
Forces a download regardless if file is present, by default False
Returns#
- str
File path to the downloaded file
Raises#
- HTTPError
HTTP error occurred during downloading the dataset.
opendataval.dataloader.util module#
- class opendataval.dataloader.util.CatDataset(*datasets: list[Dataset[Any]])#
Bases:
Dataset
[tuple
[Dataset
, …]]Data set wrapping indexable Datasets.
Parameters#
- datasetstuple[Dataset]
Tuple of data sets we would like to concat together, must be same length
Raises#
- ValueError
If all input data sets are not the same length
- class opendataval.dataloader.util.FolderDataset(folder_path: Path, sizes: list[int] | None = None)#
Bases:
Dataset
Dataset for tensors within a folder.
- BATCH_CACHE = 5#
- static exists(path: Path)#
- format_batch_path(batch_index: int) str #
- get_batch(batch_index: int) Tensor #
- classmethod load(path: Path)#
Loads existing gradient dataset metadata from path/.metadata.pkl
- property metadata: dict[str, Any]#
Important metadata defining a GradientDataset, used for loading.
- save()#
Saves metadata to disk, allows us to load GradientDataset as needed.
- property shape: tuple[int, ...]#
- write(batch_number: int, data: Tensor)#
- class opendataval.dataloader.util.IndexTransformDataset(dataset: Dataset[T_co], index_transformation: Callable[[T_co, int], T_co] | None = None)#
Bases:
Dataset
[T_co
]Data set wrapper that allows a per-index transform to be applied.
Primarily useful when adding noise to specific subset of indices. If a transform is defined, it will apply the transformation but also pass in the indices (what is passed into __getitem__) as well.
Parameters#
- datasetDataset[T_co]
Data set with transform to be applied
- index_transformationCallable[[T_co, Sequence[int]], T_co], optional
Function that takes input sequence of ints and data and applies the specific transform per index, by default None which is no transform.
- property transform: Callable[[T_co, int], T_co]#
Gets the transform function, if None, no transformation applied.
- class opendataval.dataloader.util.ListDataset(input_list: Sequence[T_co])#
Bases:
Dataset
[T_co
]Data set wrapping a list.
ListDataset is primarily useful to when you want to pass back a list but also want to get around the type checks of Datasets. This is intended to be used with NLP data sets as the the axis 1 dimension is variable and BERT tokenizers take inputs as only lists.
Parameters#
- input_listSequence[T_co]
Input sequence to be used as data set.
- opendataval.dataloader.util.load_tensor(tensor_path: Path) Tensor #
Module contents#
Create data sets and loads with DataFetcher
.
Data Loader#
Provides an API to add new data sets and load them with the data loader.
To create a new data set, create a Register
object to register the data set
with a name. Then load the data set with DataFetcher
. This allows us the
flexibility to call the dataset later and to define separate functions/classes
for the covariates and labels of a data set
Creating/Loading data sets#
|
Register a data set by defining its name and adding functions to retrieve data. |
|
Load data for an experiment from an input data set name. |
Data sets registered with |
Utils#
|
Download a file if it it is not present and returns the filepath. |
|
Mixes y_train labels of a DataFetcher, adding noise to data. |
|
One hot encodes a numpy array. |
|
Data set wrapping indexable Datasets. |