opendataval.dataloader.datasets package#
Submodules#
opendataval.dataloader.datasets.challenge module#
- opendataval.dataloader.datasets.challenge.CHALLENGE_URL = 'https://opendataval.yongchanstat.com/challenge'#
Backend URL for opendataval to get drive ids to the challenge data set.
- opendataval.dataloader.datasets.challenge.basename(file_name: str)#
Get basename of file.
- opendataval.dataloader.datasets.challenge.download_drive(name: str, drive_id: str, cache_dir: Path, force_download: bool)#
Downloads file from google drive with set retry attempts.
- opendataval.dataloader.datasets.challenge.iris_challenge(cache_dir: str, force_download: bool)#
opendataval.dataloader.datasets.datasets module#
Default data sets.
- opendataval.dataloader.datasets.datasets.download_2dplanes()#
Categorical data set registered as
"2dplanes"
.
- opendataval.dataloader.datasets.datasets.download_MiniBooNE()#
Categorical data set registered as
"MiniBooNE"
.
- opendataval.dataloader.datasets.datasets.download_adult(cache_dir: str, force_download: bool = False)#
Binary category data set registered as
"adult"
. Adult Income data set.Implementation from DVRL repository.
References#
- opendataval.dataloader.datasets.datasets.download_breast_cancer()#
Categorical data set registered as
"breast_cancer"
.
- opendataval.dataloader.datasets.datasets.download_creditcard()#
Categorical data set registered as
"creditcard"
.
- opendataval.dataloader.datasets.datasets.download_diabetes()#
Regression data set registered as
"diabetes"
.
- opendataval.dataloader.datasets.datasets.download_digits()#
Categorical data set registered as
"digits"
.
- opendataval.dataloader.datasets.datasets.download_echoMonths()#
Regression data set registered as
"echoMonths"
.
- opendataval.dataloader.datasets.datasets.download_election(cache_dir: str, force_download: bool)#
Categorical data set registered as
"election"
.Presidential election results by MIT Election Data and Science Lab.
References#
[1] M. E. Data and S. Lab, U.S. President 1976-2020. Harvard Dataverse, 2017. doi: 10.7910/DVN/42MVDX.
- opendataval.dataloader.datasets.datasets.download_electricity()#
Categorical data set registered as
"electricity"
.
- opendataval.dataloader.datasets.datasets.download_fried()#
Categorical data set registered as
"fried"
.
- opendataval.dataloader.datasets.datasets.download_iris()#
Categorical data set registered as
"iris"
.
- opendataval.dataloader.datasets.datasets.download_linnerud()#
Regression data set registered as
"linnerud"
.
- opendataval.dataloader.datasets.datasets.download_lowbwt()#
Regression data set registered as
"lowbwt"
.
- opendataval.dataloader.datasets.datasets.download_mv()#
Regression data set registered as
"mv"
.
- opendataval.dataloader.datasets.datasets.download_nomao()#
Categorical data set registered as
"nomao"
.
- opendataval.dataloader.datasets.datasets.download_pol()#
Categorical data set registered as
"pol"
.
- opendataval.dataloader.datasets.datasets.download_stock()#
Regression data set registered as
"stock"
.
- opendataval.dataloader.datasets.datasets.download_wave_energy()#
Regression data set registered as
"wave_energy"
.
- opendataval.dataloader.datasets.datasets.gaussian_classifier(n: int = 10000, input_dim: int = 10)#
Binary category data set registered as
"gaussian_classifier"
.Artificially generated gaussian noise data set.
- opendataval.dataloader.datasets.datasets.load_openml(data_id: int, is_classification=True)#
load openml datasets.
A help function to load openml datasets with OpenML ID.
opendataval.dataloader.datasets.imagesets module#
TorchVision data sets.
Uses torchvision. as a dependency.
- opendataval.dataloader.datasets.imagesets.ResnetEmbeding(dataset_class: type[VisionDataset], size: tuple[int, int] = (224, 224), batch_size: int = 128)#
Convert PIL color Images into embeddings with ResNet50 model.
Given a PIL Images, passes through ResNet50 (as done by prior Data Valuation papers) and saves the vector embeddings. The embeddings are extracted from the
avgpool
layer of ResNet50. The extraction is through the PyTorch forward hook feature.References#
[1] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, doi: https://doi.org/10.1109/cvpr.2016.90.
[2] A. Ghorbani and J. Zou, Data Shapley: Equitable Valuation of Data for Machine Learning arXiv.org, 2019. Available: https://arxiv.org/abs/1904.02868.
Parameters#
- image_settype[VisionDataset]
Class of Dataset to compute the embeddings of.
- sizetuple[int, int], optional
Size to resize images to, by default (224, 224)
Returns#
- Callable
Wrapped function when called returns a covariate embedding array and label array
- class opendataval.dataloader.datasets.imagesets.VisionAdapter(dataset_class: type[VisionDataset])#
Bases:
Dataset
Adapter for PyTorch vision data sets. __call__ is called by
Register
.Adapter for MNIST data sets. __init__ inputs the class and __call__ initializes the Dataset and extracts labels. __call__ returns tuple[Self, np.array] where Self is a Dataset of covariates and np.array is an array of labels.
Parameters#
- dataset_classtype[VisionDataset]
Torchvision data set class provided.
- opendataval.dataloader.datasets.imagesets.cifar10 = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#
Vision Classification registered as
"cifar10"
, from TorchVision.
- opendataval.dataloader.datasets.imagesets.cifar100 = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#
Vision Classification data set registered as
"cifar100"
, from TorchVision.
- opendataval.dataloader.datasets.imagesets.cifar10_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray] #
Vision Classification registered as
"cifar10-embeddings"
ResNet50 embeddings
- opendataval.dataloader.datasets.imagesets.fashion = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#
Vision Classification data set registered as
"fashion"
, from TorchVision.
- opendataval.dataloader.datasets.imagesets.numbers = <opendataval.dataloader.datasets.imagesets.VisionAdapter object>#
Vision Classification data set registered as
"mnist"
, from TorchVision.
- opendataval.dataloader.datasets.imagesets.show_image(imgs: list[Image] | Image) None #
Displays an image or a list of images.
- opendataval.dataloader.datasets.imagesets.stl10_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray] #
Vision Classification registered as
"stl10-embeddings"
ResNet50 embeddings
- opendataval.dataloader.datasets.imagesets.svhn_embed(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray] #
Vision Classification registered as
"svhn-embeddings"
ResNet50 embeddings
opendataval.dataloader.datasets.nlpsets module#
NLP data sets.
Uses HuggingFace transformers. as dependency.
- opendataval.dataloader.datasets.nlpsets.BertEmbeddings(func: Callable[[str, bool], tuple[Sequence[str], ndarray]], batch_size: int = 128)#
Convert text data into pooled embeddings with DistilBERT model.
Given a data set with a list of string, such as NLP data set function (see below), converts the sentences into strings. It is the equivalent of training a downstream task with bert but all the BERT layers are frozen. It is advised to just train with the raw strings with a BERT model located in models/bert.py or defining your own model. DistilBERT is just a faster version of BERT
References#
[1] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arXiv.org, 2018. Available: https://arxiv.org/abs/1810.04805.
[2] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter arXiv.org, 2019. Available: https://arxiv.org/abs/1910.01108.
- opendataval.dataloader.datasets.nlpsets.bbc_embedding(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray] #
Classification data set registered as
"bbc-embeddings"
, BERT text embeddings.
- opendataval.dataloader.datasets.nlpsets.download_bbc(cache_dir: str, force_download: bool)#
Classification data set registered as
"bbc"
.Predicts type of article from the article. Used in NLP data valuation tasks.
References#
[1] D. Greene and P. Cunningham, Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering, Proc. ICML 2006.
- opendataval.dataloader.datasets.nlpsets.download_imdb(cache_dir: str, force_download: bool)#
Binary category sentiment analysis data set registered as
"imdb"
.Predicts sentiment analysis of the review as either positive (1) or negative (0). Used in NLP data valuation tasks.
References#
[1] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (2011).
- opendataval.dataloader.datasets.nlpsets.imdb_embedding(cache_dir: str, force_download: bool, *args, **kwargs) tuple[Tensor, ndarray] #
Classification data set registered as
"imdb-embeddings"
, BERT text embeddings.
Module contents#
Data sets registered with Register
.
Data sets#
Catalog of registered data sets that can be used with
DataFetcher
. Pass in the str
name
registering the data set to load the data set as needed.
.