opendataval.model package#
Submodules#
opendataval.model.api module#
- class opendataval.model.api.ClassifierSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#
Bases:
Model
Wrapper for sk-learn classifiers that can have weighted fit methods.
Example:
wrapped = ClassifierSkLearnWrapper(LinearRegression(), 2)
Parameters#
- base_modelBaseModel
Any sk-learn model that supports
sample_weights
- num_classesint
Label dimensionality
- fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#
Fits the model on the training data.
Fits a sk-learn wrapped classifier Model. If there are less classes in the sample than num_classes, uses dummy model.
wrapped = ClassifierSkLearnWrapper(MLPClassifier, 2)
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- argstuple[Any]
Additional positional args
- sample_weightstorch.Tensor, optional
Weights associated with each data point, must be passed in as key word arg, by default None
- kwargsdict[str, Any]
Addition key word args
- predict(x: Tensor | Dataset) Tensor #
Predict labels from sk-learn model.
Makes a prediction based on the input tensor. Uses the .predict_proba(x) method on sk-learn classifiers. Output dim will match the input to the .train(x, y) method
Parameters#
- xtorch.Tensor | Dataset
Input tensor
Returns#
- torch.Tensor
Output tensor
- class opendataval.model.api.ClassifierUnweightedSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#
Bases:
ClassifierSkLearnWrapper
Wrapper for sk-learn classifiers that can don’t have weighted fit methods.
Example:
wrapped = ClassifierSkLearnWrapper(KNeighborsClassifier, 2)
Parameters#
- base_modelBaseModel
Any sk-learn model that supports
sample_weights
- num_classesint
Label dimensionality
- fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#
Fits the model on the training data.
Fits a sk-learn wrapped classifier Model without sample weight.
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- argstuple[Any]
Additional positional args
- sample_weightstorch.Tensor, optional
Weights associated with each data point, must be passed in as key word arg, by default None
- kwargsdict[str, Any]
Addition key word args
- class opendataval.model.api.Model#
Bases:
ABC
Abstract class of Models. Provides a template for models.
- Models: ClassVar[dict[str, Self]] = {'bertclassifier': <class 'opendataval.model.bert.BertClassifier'>, 'classifiermlp': <class 'opendataval.model.mlp.ClassifierMLP'>, 'classifiersklearnwrapper': <class 'opendataval.model.api.ClassifierSkLearnWrapper'>, 'classifierunweightedsklearnwrapper': <class 'opendataval.model.api.ClassifierUnweightedSkLearnWrapper'>, 'gradientmodel': <class 'opendataval.model.grad.GradientModel'>, 'lenet': <class 'opendataval.model.lenet.LeNet'>, 'logisticregression': <class 'opendataval.model.logistic_regression.LogisticRegression'>, 'regressionmlp': <class 'opendataval.model.mlp.RegressionMLP'>, 'regressionsklearnwrapper': <class 'opendataval.model.api.RegressionSkLearnWrapper'>, 'torchclassmixin': <class 'opendataval.model.api.TorchClassMixin'>, 'torchgradmixin': <class 'opendataval.model.grad.TorchGradMixin'>, 'torchmodel': <class 'opendataval.model.api.TorchModel'>, 'torchpredictmixin': <class 'opendataval.model.api.TorchPredictMixin'>, 'torchregressmixin': <class 'opendataval.model.api.TorchRegressMixin'>}#
- clone() Self #
Clone Model object.
Copy and returns object representing current state. We often take a base model and train it several times, so we need to have the same initial conditions Default clone implementation.
Returns#
- selfobject
Returns deep copy of model.
- abstract fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weights: Tensor | None = None, **kwargs) Self #
Fits the model on the training data.
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- argstuple[Any]
Additional positional args
- sample_weightstorch.Tensor, optional
Weights associated with each data point, must be passed in as key word arg, by default None
- kwargsdict[str, Any]
Addition key word args
Returns#
- selfobject
Returns self for api consistency with sklearn.
- class opendataval.model.api.RegressionSkLearnWrapper(base_model, *args, **kwargs)#
Bases:
Model
Wrapper for sk-learn regression models.
Example:
wrapped = RegressionSkLearnWrapper(LinearRegression)
Parameters#
- base_modelBaseModel
Any sk-learn model that supports
sample_weights
- fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#
Fits the model on the training data.
Fits a sk-learn wrapped regression Model. If there is insufficient data to fit a regression (such as len(x_train)==0), will use DummyRegressor that predicts np.zeros((num_samples, self.num_classes))
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- argstuple[Any]
Additional positional args
- sample_weightstorch.Tensor, optional
Weights associated with each data point, must be passed in as key word arg, by default None
- kwargsdict[str, Any]
Addition key word args
- predict(x: Tensor | Dataset) Tensor #
Predict values from sk-learn regression model.
Makes a prediction based on the input tensor. Uses the .predict(x) method on sk-learn regression models. Output dim will match
self.num_classes
Parameters#
- xtorch.Tensor | Dataset
Input tensor
Returns#
- torch.Tensor
Output tensor
- class opendataval.model.api.TorchClassMixin(*args, **kwargs)#
Bases:
TorchModel
Classifier Mixin for Torch Neural Networks.
- fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#
Fits the model on the training data.
Fits a torch classifier Model object using ADAM optimizer and cross categorical entropy loss.
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- batch_sizeint, optional
Training batch size, by default 32
- epochsint, optional
Number of training epochs, by default 1
- sample_weightstorch.Tensor, optional
Weights associated with each data point, by default None
- lrfloat, optional
Learning rate for the Model, by default 0.01
- training: bool#
- class opendataval.model.api.TorchModel(*args, **kwargs)#
Bases:
Model
,Module
Torch Models have a device they belong to and shared behavior
- property device#
- class opendataval.model.api.TorchPredictMixin(*args, **kwargs)#
Bases:
TorchModel
Torch
.predict()
method mixin for Torch Neural Networks.- predict(x: Tensor | Dataset) Tensor #
Predict output from input tensor/data set.
Parameters#
- xtorch.Tensor
Input covariates
Returns#
- torch.Tensor
Predicted tensor output
- training: bool#
- class opendataval.model.api.TorchRegressMixin(*args, **kwargs)#
Bases:
TorchModel
Regressor Mixin for Torch Neural Networks.
- fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#
Fits the regression model on the training data.
Fits a torch regression Model object using ADAM optimizer and MSE loss.
Parameters#
- x_traintorch.Tensor | Dataset
Data covariates
- y_traintorch.Tensor | Dataset
Data labels
- batch_sizeint, optional
Training batch size, by default 32
- epochsint, optional
Number of training epochs, by default 1
- sample_weighttorch.Tensor, optional
Weights associated with each data point, by default None
- lrfloat, optional
Learning rate for the Model, by default 0.01
- training: bool#
- opendataval.model.api.to_numpy(tensors: tuple[Tensor]) tuple[Tensor] #
Mini function to move tensor to CPU for sk-learn.
opendataval.model.bert module#
- class opendataval.model.bert.BertClassifier(pretrained_model_name: str = 'distilbert-base-uncased', num_classes: int = 2, dropout_rate: float = 0.2, num_train_layers: int = 2)#
Bases:
Model
,Module
Fine tune a pre-trained DistilBERT model on a classification task.
DistilBERT is just a smaller/lighter version of BERT meant to be fine-tuned onto other language tasks
References#
Parameters#
- pretrained_model_namestr
Huggingface model directory containing the pretrained model for BERT by default “distilbert-base-uncased” [2]
- num_classesint, optional
Number of prediction classes, by default 2
- dropout_ratefloat, optional
Dropout rate for the embeddings of bert, helps in fine tuning, by default 0.2
- num_train_layersint, optional
Number of Bert layers to fine-tune. Minimum number is 1, by default 1
- fit(x_train: Dataset[str | list[str]], y_train: Tensor, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.001)#
Fit the model on the training data.
Fine tunes a pre-trained BERT model on an input Sequence[str] by tokenizing the inputs and then fine tuning the last few layers of BERT and the classifier.
Parameters#
- x_trainDataset[str]
Training data set of sentences or list[str] to be classified
- y_traintorch.Tensor
Data Labels
- sample_weighttorch.Tensor, optional
Weights associated with each data point, must be passed in as key word arg, by default None
- batch_sizeint, optional
Training batch size, by default 2
- epochsint, optional
Number of training epochs, by default 1
- lrfloat, optional
Learning rate for the Model, by default 0.01
Returns#
- selfobject
Trained BERT classifier
- forward(input_ids: Tensor, attention_mask: Tensor | None = None)#
Forward pass through DistilBert with inputs from DistilBERT tokenizer output.
NOTE this is only applicable for a DistilBERT model that doesn’t require
token_type_ids
.Parameters#
- input_idstorch.Tensor
List of token ids to be fed to a model. [Input IDs?](https://huggingface.co/transformers/glossary#input-ids)
- attention_masktorch.Tensor
List of indices specifying which tokens should be attended to by the model, by default None [Attention?](https://huggingface.co/transformers/glossary#attention-mask)
Returns#
- torch.Tensor
Predicted labels for the classification problem
- predict(x: Dataset[str | list[str]])#
Predict output from input sentences/tokens.
Parameters#
- xDataset[str | list[str]]
Input data set of sentences or list[str]
Returns#
- torch.Tensor
Predicted labels as a tensor
- tokenize(sentences: Sequence[str | list[str]]) TensorDataset #
Convert sequence of sentences or tokens into DistilBERT inputs.
Given a sequence of sentences or tokens, computes the
input_ids
, andattention_masks
and loads them on their respective tensor device. Any changes made to the tokenizer should be reflected here and the .forward() method.Parameters#
- sentencesSequence[str | list[str]]
Sequence of sentences or tokens to be transformed into inputs for BERT.
Returns#
- TensorDataset
2 tensors representing
input_ids
andattention_masks
. For more in-depth on what each these represent:- input_ids – List of token ids to be fed to a model.
[Input IDs?](https://huggingface.co/transformers/glossary#input-ids)
- attention_mask – List of indices specifying which tokens should
be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names). [Mask?](https://huggingface.co/transformers/glossary#attention-mask)
If using a non-DistilBert tokenizer, see the below. The token type ids aren’t needed for DistilBert models. - token_type_ids – List of token type ids to be fed to a model
(when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names). [Type IDs?](https://huggingface.co/transformers/glossary#token-type-ids)
opendataval.model.grad module#
- class opendataval.model.grad.GradientModel#
Bases:
Model
Provides access to gradients of a
Model
TODO Some data evaluators may benefit from higher-order gradients or hessians.
- abstract grad(x_data: Tensor | Dataset, y_train: Tensor | Dataset, *args, **kwargs) Iterator[tuple[Tensor, ...]] #
Given input data, iterates through the computed gradients of the model.
Will yield a tuple with gradients for each layer of the model for each input data. The data the underlying model is trained on does not have to be the data the gradients of the model are computed for. An iterator is used because storing the computed gradient for each data point use up lots of memory.
Parameters#
- x_dataUnion[torch.Tensor, Dataset]
Data covariates
- y_dataUnion[torch.Tensor, Dataset]
Data labels
Yields#
- Iterator[tuple[torch.Tensor, …]]
Computed gradients (for each layer as tuple) yielded by data point in order
- class opendataval.model.grad.TorchGradMixin(*args, **kwargs)#
Bases:
GradientModel
,TorchModel
Gradient Mixin for Torch Neural Networks.
- grad(x_data: Tensor | Dataset, y_data: Tensor | Dataset) Iterator[tuple[Tensor, ...]] #
Given input data, yields the computed gradients for a torch model
Parameters#
- x_dataUnion[torch.Tensor, Dataset]
Data covariates
- y_dataUnion[torch.Tensor, Dataset]
Data labels
Yields#
- Iterator[tuple[torch.Tensor, …]]
Computed gradients (for each layer as tuple) yielded by data point in order
- training: bool#
opendataval.model.lenet module#
- class opendataval.model.lenet.LeNet(num_classes: int, gray_scale: bool = True)#
Bases:
TorchClassMixin
,TorchPredictMixin
LeNet-5 convolutional neural net classifier.
Consists of 2 5x5 convolution kernels and a MLP classifier. LeNet-5 was one of the earliest conceived CNNs and was typically applied to digit analysis. LeNet-5 can but doesn’t generalize particularly well to higher dimension (such as color) images.
References#
[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998, doi: https://doi.org/10.1109/5.726791.
Parameters#
- num_classesint
Number of prediction classes
- gray_scalebool, optional
Whether the input image is gray scaled. LeNet has been noted to not perform as well with color, so disable gray_scale at your own risk, by default True
- forward(x: Tensor)#
Forward pass of LeNet-5.
opendataval.model.logistic_regression module#
- class opendataval.model.logistic_regression.LogisticRegression(input_dim: int, num_classes: int)#
Bases:
TorchClassMixin
,TorchPredictMixin
,TorchGradMixin
Initialize LogisticRegression
Parameters#
- input_dimint
Size of the input dimension of the LogisticRegression
- num_classesint
Size of the output dimension of the LR, outputs selection probabilities
opendataval.model.mlp module#
- class opendataval.model.mlp.ClassifierMLP(input_dim: int, num_classes: int, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#
Bases:
TorchClassMixin
,TorchPredictMixin
,TorchGradMixin
Initializes the Multilayer Perceptron Classifier.
Parameters#
- input_dimint
Size of the input dimension of the MLP
- num_classesint
Size of the output dimension of the MLP, outputs selection probabilities
- layersint, optional
Number of layers for the MLP, by default 5
- hidden_dimint, optional
Hidden dimension for the MLP, by default 25
- act_fnCallable, optional
Activation function for MLP, if none, set to nn.ReLU, by default None
- class opendataval.model.mlp.RegressionMLP(input_dim: int, num_classes: int = 1, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#
Bases:
TorchRegressMixin
,TorchPredictMixin
,TorchGradMixin
Initializes the Multilayer Perceptron Regression.
Parameters#
- input_dimint
Size of the input dimension of the MLP
- num_classesint
Size of the output dimension of the MLP, >1 means multi dimension output
- layersint, optional
Number of layers for the MLP, by default 5
- hidden_dimint, optional
Hidden dimension for the MLP, by default 25
- act_fnCallable, optional
Activation function for MLP, if none, set to nn.ReLU, by default None
Module contents#
Prediction models to be trained, predict, and evaluated.
Models#
Model
is an ABC used to take an existing model and make it compatible with
the DataEvaluator
and other related objects.
API#
|
Abstract class of Models. |
Provides access to gradients of a |
|
|
Factory to create prediction models from specified presets |
Torch Mixins#
|
Classifier Mixin for Torch Neural Networks. |
|
Regressor Mixin for Torch Neural Networks. |
|
Torch |
|
Gradient Mixin for Torch Neural Networks. |
Sci-kit learn wrappers#
|
Wrapper for sk-learn classifiers that can have weighted fit methods. |
Wrapper for sk-learn classifiers that can don't have weighted fit methods. |
|
|
Wrapper for sk-learn regression models. |
Default Hyperparameters#
- opendataval.model.ModelFactory(model_name: str, fetcher: DataFetcher | None = None, device: device = device(type='cpu'), *args, **kwargs) Model #
Factory to create prediction models from specified presets
Model Factory that creates a specified mode, based on the input parameters, it is recommended to import the specific model and specify additional arguments instead of relying on the factory.
Parameters#
- model_namestr
Name of prediction model
- covar_dimtuple[int, …]
Dimensions of the covariates, typically the shape besides first dimension
- label_dimtuple[int, …]
Dimensions of the labels, typically the shape besides first dimension
- devicetorch.device, optional
Tensor device for acceleration, some models do not use this argument, by default torch.device(“cpu”)
- argstuple[Any]
Additional positional arguments passed to the Model constructor
- kwargstuple[Any]
Additional key word arguments passed to the Model constructor
Returns#
- Model
Preset model with the specified dimensions on the specified tensor device
Raises#
- ValueError
Raises exception when model name is not matched