opendataval.model package#

Submodules#

opendataval.model.api module#

class opendataval.model.api.ClassifierSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#

Bases: Model

Wrapper for sk-learn classifiers that can have weighted fit methods.

Example:

wrapped = ClassifierSkLearnWrapper(LinearRegression(), 2)

Parameters#

base_modelBaseModel: Any sk-learn model that supports sample_weights
num_classesint: Label dimensionality

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped classifier Model. If there are less classes in the sample than num_classes, uses dummy model.

wrapped = ClassifierSkLearnWrapper(MLPClassifier, 2)

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
argstuple[Any]: Additional positional args
sample_weightstorch.Tensor, optional: Weights associated with each data point, must be passed in as key word arg, by default None
kwargsdict[str, Any]: Addition key word args

predict(x: Tensor | Dataset) → Tensor#

Predict labels from sk-learn model.

Makes a prediction based on the input tensor. Uses the .predict_proba(x) method on sk-learn classifiers. Output dim will match the input to the .train(x, y) method

Parameters#

xtorch.Tensor | Dataset: Input tensor

Returns#

torch.Tensor: Output tensor

class opendataval.model.api.ClassifierUnweightedSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#

Bases: ClassifierSkLearnWrapper

Wrapper for sk-learn classifiers that can don’t have weighted fit methods.

Example:

wrapped = ClassifierSkLearnWrapper(KNeighborsClassifier, 2)

Parameters#

base_modelBaseModel: Any sk-learn model that supports sample_weights
num_classesint: Label dimensionality

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped classifier Model without sample weight.

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
argstuple[Any]: Additional positional args
sample_weightstorch.Tensor, optional: Weights associated with each data point, must be passed in as key word arg, by default None
kwargsdict[str, Any]: Addition key word args

class opendataval.model.api.Model#

Bases: ABC

Abstract class of Models. Provides a template for models.

Models: ClassVar[dict[str, Self]] = {'bertclassifier': <class 'opendataval.model.bert.BertClassifier'>, 'classifiermlp': <class 'opendataval.model.mlp.ClassifierMLP'>, 'classifiersklearnwrapper': <class 'opendataval.model.api.ClassifierSkLearnWrapper'>, 'classifierunweightedsklearnwrapper': <class 'opendataval.model.api.ClassifierUnweightedSkLearnWrapper'>, 'gradientmodel': <class 'opendataval.model.grad.GradientModel'>, 'lenet': <class 'opendataval.model.lenet.LeNet'>, 'logisticregression': <class 'opendataval.model.logistic_regression.LogisticRegression'>, 'regressionmlp': <class 'opendataval.model.mlp.RegressionMLP'>, 'regressionsklearnwrapper': <class 'opendataval.model.api.RegressionSkLearnWrapper'>, 'torchclassmixin': <class 'opendataval.model.api.TorchClassMixin'>, 'torchgradmixin': <class 'opendataval.model.grad.TorchGradMixin'>, 'torchmodel': <class 'opendataval.model.api.TorchModel'>, 'torchpredictmixin': <class 'opendataval.model.api.TorchPredictMixin'>, 'torchregressmixin': <class 'opendataval.model.api.TorchRegressMixin'>}#

clone() → Self#

Clone Model object.

Copy and returns object representing current state. We often take a base model and train it several times, so we need to have the same initial conditions Default clone implementation.

Returns#

selfobject: Returns deep copy of model.

abstract fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weights: Tensor | None = None, **kwargs) → Self#

Fits the model on the training data.

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
argstuple[Any]: Additional positional args
sample_weightstorch.Tensor, optional: Weights associated with each data point, must be passed in as key word arg, by default None
kwargsdict[str, Any]: Addition key word args

Returns#

selfobject: Returns self for api consistency with sklearn.

abstract predict(x: Tensor | Dataset, *args, **kwargs) → Tensor#

Predict the label from the input covariates data.

Parameters#

xtorch.Tensor | Dataset: Input data covariates

Returns#

torch.Tensor: Output predictions based on the input

class opendataval.model.api.RegressionSkLearnWrapper(base_model, *args, **kwargs)#

Bases: Model

Wrapper for sk-learn regression models.

Example:

wrapped = RegressionSkLearnWrapper(LinearRegression)

Parameters#

base_modelBaseModel: Any sk-learn model that supports sample_weights

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped regression Model. If there is insufficient data to fit a regression (such as len(x_train)==0), will use DummyRegressor that predicts np.zeros((num_samples, self.num_classes))

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
argstuple[Any]: Additional positional args
sample_weightstorch.Tensor, optional: Weights associated with each data point, must be passed in as key word arg, by default None
kwargsdict[str, Any]: Addition key word args

predict(x: Tensor | Dataset) → Tensor#

Predict values from sk-learn regression model.

Makes a prediction based on the input tensor. Uses the .predict(x) method on sk-learn regression models. Output dim will match self.num_classes

Parameters#

xtorch.Tensor | Dataset: Input tensor

Returns#

torch.Tensor: Output tensor

class opendataval.model.api.TorchClassMixin(*args, **kwargs)#

Bases: TorchModel

Classifier Mixin for Torch Neural Networks.

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#

Fits the model on the training data.

Fits a torch classifier Model object using ADAM optimizer and cross categorical entropy loss.

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
batch_sizeint, optional: Training batch size, by default 32
epochsint, optional: Number of training epochs, by default 1
sample_weightstorch.Tensor, optional: Weights associated with each data point, by default None
lrfloat, optional: Learning rate for the Model, by default 0.01

training: bool#

class opendataval.model.api.TorchModel(*args, **kwargs)#

Bases: Model, Module

Torch Models have a device they belong to and shared behavior

property device#

class opendataval.model.api.TorchPredictMixin(*args, **kwargs)#

Bases: TorchModel

Torch .predict() method mixin for Torch Neural Networks.

predict(x: Tensor | Dataset) → Tensor#

Predict output from input tensor/data set.

Parameters#

xtorch.Tensor: Input covariates

Returns#

torch.Tensor: Predicted tensor output

training: bool#

class opendataval.model.api.TorchRegressMixin(*args, **kwargs)#

Bases: TorchModel

Regressor Mixin for Torch Neural Networks.

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#

Fits the regression model on the training data.

Fits a torch regression Model object using ADAM optimizer and MSE loss.

Parameters#

x_traintorch.Tensor | Dataset: Data covariates
y_traintorch.Tensor | Dataset: Data labels
batch_sizeint, optional: Training batch size, by default 32
epochsint, optional: Number of training epochs, by default 1
sample_weighttorch.Tensor, optional: Weights associated with each data point, by default None
lrfloat, optional: Learning rate for the Model, by default 0.01

training: bool#

opendataval.model.api.to_numpy(tensors: tuple[Tensor]) → tuple[Tensor]#: Mini function to move tensor to CPU for sk-learn.

opendataval.model.bert module#

class opendataval.model.bert.BertClassifier(pretrained_model_name: str = 'distilbert-base-uncased', num_classes: int = 2, dropout_rate: float = 0.2, num_train_layers: int = 2)#

Bases: Model, Module

Fine tune a pre-trained DistilBERT model on a classification task.

DistilBERT is just a smaller/lighter version of BERT meant to be fine-tuned onto other language tasks

References#

Parameters#

pretrained_model_namestr: Huggingface model directory containing the pretrained model for BERT by default “distilbert-base-uncased” [2]
num_classesint, optional: Number of prediction classes, by default 2
dropout_ratefloat, optional: Dropout rate for the embeddings of bert, helps in fine tuning, by default 0.2
num_train_layersint, optional: Number of Bert layers to fine-tune. Minimum number is 1, by default 1

fit(x_train: Dataset[str | list[str]], y_train: Tensor, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.001)#

Fit the model on the training data.

Fine tunes a pre-trained BERT model on an input Sequence[str] by tokenizing the inputs and then fine tuning the last few layers of BERT and the classifier.

Parameters#

x_trainDataset[str]: Training data set of sentences or list[str] to be classified
y_traintorch.Tensor: Data Labels
sample_weighttorch.Tensor, optional: Weights associated with each data point, must be passed in as key word arg, by default None
batch_sizeint, optional: Training batch size, by default 2
epochsint, optional: Number of training epochs, by default 1
lrfloat, optional: Learning rate for the Model, by default 0.01

Returns#

selfobject: Trained BERT classifier

forward(input_ids: Tensor, attention_mask: Tensor | None = None)#

Forward pass through DistilBert with inputs from DistilBERT tokenizer output.

NOTE this is only applicable for a DistilBERT model that doesn’t require token_type_ids.

Parameters#

input_idstorch.Tensor: List of token ids to be fed to a model. [Input IDs?](https://huggingface.co/transformers/glossary#input-ids)
attention_masktorch.Tensor: List of indices specifying which tokens should be attended to by the model, by default None [Attention?](https://huggingface.co/transformers/glossary#attention-mask)

Returns#

torch.Tensor: Predicted labels for the classification problem

predict(x: Dataset[str | list[str]])#

Predict output from input sentences/tokens.

Parameters#

xDataset[str | list[str]]: Input data set of sentences or list[str]

Returns#

torch.Tensor: Predicted labels as a tensor

tokenize(sentences: Sequence[str | list[str]]) → TensorDataset#

Convert sequence of sentences or tokens into DistilBERT inputs.

Given a sequence of sentences or tokens, computes the input_ids, and attention_masks and loads them on their respective tensor device. Any changes made to the tokenizer should be reflected here and the .forward() method.

Parameters#

sentencesSequence[str | list[str]]: Sequence of sentences or tokens to be transformed into inputs for BERT.

Returns#

TensorDataset

2 tensors representing input_ids and attention_masks. For more in-depth on what each these represent:

input_ids – List of token ids to be fed to a model.
[Input IDs?](https://huggingface.co/transformers/glossary#input-ids)
attention_mask – List of indices specifying which tokens should
be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names). [Mask?](https://huggingface.co/transformers/glossary#attention-mask)

If using a non-DistilBert tokenizer, see the below. The token type ids aren’t needed for DistilBert models. - token_type_ids – List of token type ids to be fed to a model

(when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names). [Type IDs?](https://huggingface.co/transformers/glossary#token-type-ids)

opendataval.model.grad module#

class opendataval.model.grad.GradientModel#

Bases: Model

Provides access to gradients of a Model

TODO Some data evaluators may benefit from higher-order gradients or hessians.

abstract grad(x_data: Tensor | Dataset, y_train: Tensor | Dataset, *args, **kwargs) → Iterator[tuple[Tensor, ...]]#

Given input data, iterates through the computed gradients of the model.

Will yield a tuple with gradients for each layer of the model for each input data. The data the underlying model is trained on does not have to be the data the gradients of the model are computed for. An iterator is used because storing the computed gradient for each data point use up lots of memory.

Parameters#

x_dataUnion[torch.Tensor, Dataset]: Data covariates
y_dataUnion[torch.Tensor, Dataset]: Data labels

Yields#

Iterator[tuple[torch.Tensor, …]]: Computed gradients (for each layer as tuple) yielded by data point in order

class opendataval.model.grad.TorchGradMixin(*args, **kwargs)#

Bases: GradientModel, TorchModel

Gradient Mixin for Torch Neural Networks.

grad(x_data: Tensor | Dataset, y_data: Tensor | Dataset) → Iterator[tuple[Tensor, ...]]#

Given input data, yields the computed gradients for a torch model

Parameters#

x_dataUnion[torch.Tensor, Dataset]: Data covariates
y_dataUnion[torch.Tensor, Dataset]: Data labels

Yields#

Iterator[tuple[torch.Tensor, …]]: Computed gradients (for each layer as tuple) yielded by data point in order

training: bool#

opendataval.model.lenet module#

class opendataval.model.lenet.LeNet(num_classes: int, gray_scale: bool = True)#

Bases: TorchClassMixin, TorchPredictMixin

LeNet-5 convolutional neural net classifier.

Consists of 2 5x5 convolution kernels and a MLP classifier. LeNet-5 was one of the earliest conceived CNNs and was typically applied to digit analysis. LeNet-5 can but doesn’t generalize particularly well to higher dimension (such as color) images.

References#

Parameters#

num_classesint: Number of prediction classes
gray_scalebool, optional: Whether the input image is gray scaled. LeNet has been noted to not perform as well with color, so disable gray_scale at your own risk, by default True

forward(x: Tensor)#: Forward pass of LeNet-5.

opendataval.model.logistic_regression module#

class opendataval.model.logistic_regression.LogisticRegression(input_dim: int, num_classes: int)#

Bases: TorchClassMixin, TorchPredictMixin, TorchGradMixin

Initialize LogisticRegression

Parameters#

input_dimint: Size of the input dimension of the LogisticRegression
num_classesint: Size of the output dimension of the LR, outputs selection probabilities

forward(x: Tensor) → Tensor#

Forward pass of Logistic Regression.

Parameters#

xtorch.Tensor: Input tensor

Returns#

torch.Tensor: Output Tensor of logistic regression

opendataval.model.mlp module#

class opendataval.model.mlp.ClassifierMLP(input_dim: int, num_classes: int, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#

Bases: TorchClassMixin, TorchPredictMixin, TorchGradMixin

Initializes the Multilayer Perceptron Classifier.

Parameters#

input_dimint: Size of the input dimension of the MLP
num_classesint: Size of the output dimension of the MLP, outputs selection probabilities
layersint, optional: Number of layers for the MLP, by default 5
hidden_dimint, optional: Hidden dimension for the MLP, by default 25
act_fnCallable, optional: Activation function for MLP, if none, set to nn.ReLU, by default None

forward(x: Tensor) → Tensor#

Forward pass of MLP Neural Network.

Parameters#

xtorch.Tensor: Input tensor

Returns#

torch.Tensor: Output Tensor of MLP

class opendataval.model.mlp.RegressionMLP(input_dim: int, num_classes: int = 1, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#

Bases: TorchRegressMixin, TorchPredictMixin, TorchGradMixin

Initializes the Multilayer Perceptron Regression.

Parameters#

input_dimint: Size of the input dimension of the MLP
num_classesint: Size of the output dimension of the MLP, >1 means multi dimension output
layersint, optional: Number of layers for the MLP, by default 5
hidden_dimint, optional: Hidden dimension for the MLP, by default 25
act_fnCallable, optional: Activation function for MLP, if none, set to nn.ReLU, by default None

forward(x: Tensor) → Tensor#

Forward pass of Multilayer Perceptron.

Parameters#

xtorch.Tensor: Input tensor

Returns#

torch.Tensor: Output Tensor of MLP

Module contents#

Prediction models to be trained, predict, and evaluated.

Models#

Model is an ABC used to take an existing model and make it compatible with the DataEvaluator and other related objects.

API#

`Model`()	Abstract class of Models.
`GradientModel`()	Provides access to gradients of a `Model`
`ModelFactory`(model_name[, fetcher, device])	Factory to create prediction models from specified presets

Torch Mixins#

`TorchClassMixin`(args, *kwargs)	Classifier Mixin for Torch Neural Networks.
`TorchRegressMixin`(args, *kwargs)	Regressor Mixin for Torch Neural Networks.
`TorchPredictMixin`(args, *kwargs)	Torch `.predict()` method mixin for Torch Neural Networks.
`TorchGradMixin`(args, *kwargs)	Gradient Mixin for Torch Neural Networks.

Sci-kit learn wrappers#

`ClassifierSkLearnWrapper`(base_model, ...)	Wrapper for sk-learn classifiers that can have weighted fit methods.
`ClassifierUnweightedSkLearnWrapper`(...)	Wrapper for sk-learn classifiers that can don't have weighted fit methods.
`RegressionSkLearnWrapper`(base_model, *args, ...)	Wrapper for sk-learn regression models.

Default Hyperparameters#

\[ \begin{align}\begin{aligned}\newcommand\T{\Rule{0pt}{1em}{.3em}}\\\begin{split}\begin{array}{llll} \hline \textbf{Algorithm} & \textbf{Hyperparameter} & \textbf{Default Value} & \textbf{Key word argument} \\ \hline \mbox{Logistic Regression} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{MLP Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{BERT Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.001 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAMW} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{LeNet-5 Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{MLP Regression} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Mean Square Error} & \mbox{no} \\ \hline \end{array}\end{split}\end{aligned}\end{align} \]

opendataval.model.ModelFactory(model_name: str, fetcher: DataFetcher | None = None, device: device = device(type='cpu'), *args, **kwargs) → Model#

Factory to create prediction models from specified presets

Model Factory that creates a specified mode, based on the input parameters, it is recommended to import the specific model and specify additional arguments instead of relying on the factory.

Parameters#

model_namestr: Name of prediction model
covar_dimtuple[int, …]: Dimensions of the covariates, typically the shape besides first dimension
label_dimtuple[int, …]: Dimensions of the labels, typically the shape besides first dimension
devicetorch.device, optional: Tensor device for acceleration, some models do not use this argument, by default torch.device(“cpu”)
argstuple[Any]: Additional positional arguments passed to the Model constructor
kwargstuple[Any]: Additional key word arguments passed to the Model constructor

Returns#

Model: Preset model with the specified dimensions on the specified tensor device

Raises#

ValueError: Raises exception when model name is not matched