opendataval.dataloader.datasets package#

Submodules#

opendataval.dataloader.datasets.challenge module#

opendataval.dataloader.datasets.challenge.CHALLENGE_URL = 'https://opendataval.yongchanstat.com/challenge'#

Backend URL for opendataval to get drive ids to the challenge data set.

opendataval.dataloader.datasets.challenge.basename(file_name: str)#

Get basename of file.

opendataval.dataloader.datasets.challenge.download_drive(name: str, drive_id: str, cache_dir: Path, force_download: bool)#

Downloads file from google drive with set retry attempts.

opendataval.dataloader.datasets.challenge.iris_challenge(cache_dir: str, force_download: bool)#

opendataval.dataloader.datasets.datasets module#

Default data sets.

opendataval.dataloader.datasets.datasets.download_2dplanes()#

Categorical data set registered as "2dplanes".

opendataval.dataloader.datasets.datasets.download_MiniBooNE()#

Categorical data set registered as "MiniBooNE".

opendataval.dataloader.datasets.datasets.download_adult(cache_dir: str, force_download: bool = False)#

Binary category data set registered as "adult". Adult Income data set.

Implementation from DVRL repository.

References#

opendataval.dataloader.datasets.datasets.download_breast_cancer()#

Categorical data set registered as "breast_cancer".

opendataval.dataloader.datasets.datasets.download_creditcard()#

Categorical data set registered as "creditcard".

opendataval.dataloader.datasets.datasets.download_diabetes()#

Regression data set registered as "diabetes".

opendataval.dataloader.datasets.datasets.download_digits()#

Categorical data set registered as "digits".

opendataval.dataloader.datasets.datasets.download_echoMonths()#

Regression data set registered as "echoMonths".

opendataval.dataloader.datasets.datasets.download_election(cache_dir: str, force_download: bool)#

Categorical data set registered as "election".

Presidential election results by MIT Election Data and Science Lab.

References#

opendataval.dataloader.datasets.datasets.download_electricity()#

Categorical data set registered as "electricity".

opendataval.dataloader.datasets.datasets.download_fried()#

Categorical data set registered as "fried".

opendataval.dataloader.datasets.datasets.download_iris()#

Categorical data set registered as "iris".

opendataval.dataloader.datasets.datasets.download_linnerud()#

Regression data set registered as "linnerud".

opendataval.dataloader.datasets.datasets.download_lowbwt()#

Regression data set registered as "lowbwt".

opendataval.dataloader.datasets.datasets.download_mv()#

Regression data set registered as "mv".

opendataval.dataloader.datasets.datasets.download_nomao()#

Categorical data set registered as "nomao".

opendataval.dataloader.datasets.datasets.download_pol()#

Categorical data set registered as "pol".

opendataval.dataloader.datasets.datasets.download_stock()#

Regression data set registered as "stock".

opendataval.dataloader.datasets.datasets.download_wave_energy()#

Regression data set registered as "wave_energy".

opendataval.dataloader.datasets.datasets.gaussian_classifier(n: int = 10000, input_dim: int = 10)#

Binary category data set registered as "gaussian_classifier".

Artificially generated gaussian noise data set.

opendataval.dataloader.datasets.datasets.load_openml(data_id: int, is_classification=True)#

load openml datasets.

A help function to load openml datasets with OpenML ID.

opendataval.dataloader.datasets.imagesets module#

TorchVision data sets.

Uses torchvision. as a dependency.

opendataval.dataloader.datasets.imagesets.ResnetEmbeding(dataset_class: type[VisionDataset], size: tuple[int, int] = (224, 224), batch_size: int = 128)#

Convert PIL color Images into embeddings with ResNet50 model.

Given a PIL Images, passes through ResNet50 (as done by prior Data Valuation papers) and saves the vector embeddings. The embeddings are extracted from the avgpool layer of ResNet50. The extraction is through the PyTorch forward hook feature.

References#

Parameters#

image_settype[VisionDataset]

Class of Dataset to compute the embeddings of.

sizetuple[int, int], optional

Size to resize images to, by default (224, 224)

Returns#

Callable

Wrapped function when called returns a covariate embedding array and label array

class opendataval.dataloader.datasets.imagesets.VisionAdapter(dataset_class: type[VisionDataset])#

Bases: Dataset

Adapter for PyTorch vision data sets. __call__ is called by Register.

Adapter for MNIST data sets. __init__ inputs the class and __call__ initializes the Dataset and extracts labels. __call__ returns tuple[Self, np.array] where Self is a Dataset of covariates and np.array is an array of labels.

Parameters#

dataset_classtype[VisionDataset]

Torchvision data set class provided.

opendataval.dataloader.datasets.imagesets.cifar10 = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#

Vision Classification registered as "cifar10", from TorchVision.

opendataval.dataloader.datasets.imagesets.cifar100 = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#

Vision Classification data set registered as "cifar100", from TorchVision.

opendataval.dataloader.datasets.imagesets.cifar10_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray]#

Vision Classification registered as "cifar10-embeddings" ResNet50 embeddings

opendataval.dataloader.datasets.imagesets.fashion = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#

Vision Classification data set registered as "fashion", from TorchVision.

opendataval.dataloader.datasets.imagesets.numbers = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#

Vision Classification data set registered as "mnist", from TorchVision.

opendataval.dataloader.datasets.imagesets.show_image(imgs: list[Image] | Image) None#

Displays an image or a list of images.

opendataval.dataloader.datasets.imagesets.stl10_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray]#

Vision Classification registered as "stl10-embeddings" ResNet50 embeddings

opendataval.dataloader.datasets.imagesets.svhn_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray]#

Vision Classification registered as "svhn-embeddings" ResNet50 embeddings

opendataval.dataloader.datasets.nlpsets module#

NLP data sets.

Uses HuggingFace transformers. as dependency.

opendataval.dataloader.datasets.nlpsets.BertEmbeddings(func: Callable[[str, bool], tuple[Sequence[str], ndarray]], batch_size: int = 128)#

Convert text data into pooled embeddings with DistilBERT model.

Given a data set with a list of string, such as NLP data set function (see below), converts the sentences into strings. It is the equivalent of training a downstream task with bert but all the BERT layers are frozen. It is advised to just train with the raw strings with a BERT model located in models/bert.py or defining your own model. DistilBERT is just a faster version of BERT

References#

opendataval.dataloader.datasets.nlpsets.bbc_embedding(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray]#

Classification data set registered as "bbc-embeddings", BERT text embeddings.

opendataval.dataloader.datasets.nlpsets.download_bbc(cache_dir: str, force_download: bool)#

Classification data set registered as "bbc".

Predicts type of article from the article. Used in NLP data valuation tasks.

References#

opendataval.dataloader.datasets.nlpsets.download_imdb(cache_dir: str, force_download: bool)#

Binary category sentiment analysis data set registered as "imdb".

Predicts sentiment analysis of the review as either positive (1) or negative (0). Used in NLP data valuation tasks.

References#

opendataval.dataloader.datasets.nlpsets.imdb_embedding(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray]#

Classification data set registered as "imdb-embeddings", BERT text embeddings.

Module contents#

Data sets registered with Register.

Data sets#

datasets

Default data sets.

imagesets

TorchVision data sets.

nlpsets

NLP data sets.

Catalog of registered data sets that can be used with DataFetcher. Pass in the str name registering the data set to load the data set as needed. .