opendataval.model package#

Submodules#

opendataval.model.api module#

class opendataval.model.api.ClassifierSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#

Bases: Model

Wrapper for sk-learn classifiers that can have weighted fit methods.

Example:

wrapped = ClassifierSkLearnWrapper(LinearRegression(), 2)

Parameters#

base_modelBaseModel

Any sk-learn model that supports sample_weights

num_classesint

Label dimensionality

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped classifier Model. If there are less classes in the sample than num_classes, uses dummy model.

wrapped = ClassifierSkLearnWrapper(MLPClassifier, 2)

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

argstuple[Any]

Additional positional args

sample_weightstorch.Tensor, optional

Weights associated with each data point, must be passed in as key word arg, by default None

kwargsdict[str, Any]

Addition key word args

predict(x: Tensor | Dataset) Tensor#

Predict labels from sk-learn model.

Makes a prediction based on the input tensor. Uses the .predict_proba(x) method on sk-learn classifiers. Output dim will match the input to the .train(x, y) method

Parameters#

xtorch.Tensor | Dataset

Input tensor

Returns#

torch.Tensor

Output tensor

class opendataval.model.api.ClassifierUnweightedSkLearnWrapper(base_model, num_classes: int, *args, **kwargs)#

Bases: ClassifierSkLearnWrapper

Wrapper for sk-learn classifiers that can don’t have weighted fit methods.

Example:

wrapped = ClassifierSkLearnWrapper(KNeighborsClassifier, 2)

Parameters#

base_modelBaseModel

Any sk-learn model that supports sample_weights

num_classesint

Label dimensionality

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped classifier Model without sample weight.

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

argstuple[Any]

Additional positional args

sample_weightstorch.Tensor, optional

Weights associated with each data point, must be passed in as key word arg, by default None

kwargsdict[str, Any]

Addition key word args

class opendataval.model.api.Model#

Bases: ABC

Abstract class of Models. Provides a template for models.

Models: ClassVar[dict[str, Self]] = {'bertclassifier': <class 'opendataval.model.bert.BertClassifier'>, 'classifiermlp': <class 'opendataval.model.mlp.ClassifierMLP'>, 'classifiersklearnwrapper': <class 'opendataval.model.api.ClassifierSkLearnWrapper'>, 'classifierunweightedsklearnwrapper': <class 'opendataval.model.api.ClassifierUnweightedSkLearnWrapper'>, 'gradientmodel': <class 'opendataval.model.grad.GradientModel'>, 'lenet': <class 'opendataval.model.lenet.LeNet'>, 'logisticregression': <class 'opendataval.model.logistic_regression.LogisticRegression'>, 'regressionmlp': <class 'opendataval.model.mlp.RegressionMLP'>, 'regressionsklearnwrapper': <class 'opendataval.model.api.RegressionSkLearnWrapper'>, 'torchclassmixin': <class 'opendataval.model.api.TorchClassMixin'>, 'torchgradmixin': <class 'opendataval.model.grad.TorchGradMixin'>, 'torchmodel': <class 'opendataval.model.api.TorchModel'>, 'torchpredictmixin': <class 'opendataval.model.api.TorchPredictMixin'>, 'torchregressmixin': <class 'opendataval.model.api.TorchRegressMixin'>}#
clone() Self#

Clone Model object.

Copy and returns object representing current state. We often take a base model and train it several times, so we need to have the same initial conditions Default clone implementation.

Returns#

selfobject

Returns deep copy of model.

abstract fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weights: Tensor | None = None, **kwargs) Self#

Fits the model on the training data.

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

argstuple[Any]

Additional positional args

sample_weightstorch.Tensor, optional

Weights associated with each data point, must be passed in as key word arg, by default None

kwargsdict[str, Any]

Addition key word args

Returns#

selfobject

Returns self for api consistency with sklearn.

abstract predict(x: Tensor | Dataset, *args, **kwargs) Tensor#

Predict the label from the input covariates data.

Parameters#

xtorch.Tensor | Dataset

Input data covariates

Returns#

torch.Tensor

Output predictions based on the input

class opendataval.model.api.RegressionSkLearnWrapper(base_model, *args, **kwargs)#

Bases: Model

Wrapper for sk-learn regression models.

Example:

wrapped = RegressionSkLearnWrapper(LinearRegression)

Parameters#

base_modelBaseModel

Any sk-learn model that supports sample_weights

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, *args, sample_weight: Tensor | None = None, **kwargs)#

Fits the model on the training data.

Fits a sk-learn wrapped regression Model. If there is insufficient data to fit a regression (such as len(x_train)==0), will use DummyRegressor that predicts np.zeros((num_samples, self.num_classes))

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

argstuple[Any]

Additional positional args

sample_weightstorch.Tensor, optional

Weights associated with each data point, must be passed in as key word arg, by default None

kwargsdict[str, Any]

Addition key word args

predict(x: Tensor | Dataset) Tensor#

Predict values from sk-learn regression model.

Makes a prediction based on the input tensor. Uses the .predict(x) method on sk-learn regression models. Output dim will match self.num_classes

Parameters#

xtorch.Tensor | Dataset

Input tensor

Returns#

torch.Tensor

Output tensor

class opendataval.model.api.TorchClassMixin(*args, **kwargs)#

Bases: TorchModel

Classifier Mixin for Torch Neural Networks.

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#

Fits the model on the training data.

Fits a torch classifier Model object using ADAM optimizer and cross categorical entropy loss.

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

batch_sizeint, optional

Training batch size, by default 32

epochsint, optional

Number of training epochs, by default 1

sample_weightstorch.Tensor, optional

Weights associated with each data point, by default None

lrfloat, optional

Learning rate for the Model, by default 0.01

training: bool#
class opendataval.model.api.TorchModel(*args, **kwargs)#

Bases: Model, Module

Torch Models have a device they belong to and shared behavior

property device#
class opendataval.model.api.TorchPredictMixin(*args, **kwargs)#

Bases: TorchModel

Torch .predict() method mixin for Torch Neural Networks.

predict(x: Tensor | Dataset) Tensor#

Predict output from input tensor/data set.

Parameters#

xtorch.Tensor

Input covariates

Returns#

torch.Tensor

Predicted tensor output

training: bool#
class opendataval.model.api.TorchRegressMixin(*args, **kwargs)#

Bases: TorchModel

Regressor Mixin for Torch Neural Networks.

fit(x_train: Tensor | Dataset, y_train: Tensor | Dataset, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.01)#

Fits the regression model on the training data.

Fits a torch regression Model object using ADAM optimizer and MSE loss.

Parameters#

x_traintorch.Tensor | Dataset

Data covariates

y_traintorch.Tensor | Dataset

Data labels

batch_sizeint, optional

Training batch size, by default 32

epochsint, optional

Number of training epochs, by default 1

sample_weighttorch.Tensor, optional

Weights associated with each data point, by default None

lrfloat, optional

Learning rate for the Model, by default 0.01

training: bool#
opendataval.model.api.to_numpy(tensors: tuple[Tensor]) tuple[Tensor]#

Mini function to move tensor to CPU for sk-learn.

opendataval.model.bert module#

class opendataval.model.bert.BertClassifier(pretrained_model_name: str = 'distilbert-base-uncased', num_classes: int = 2, dropout_rate: float = 0.2, num_train_layers: int = 2)#

Bases: Model, Module

Fine tune a pre-trained DistilBERT model on a classification task.

DistilBERT is just a smaller/lighter version of BERT meant to be fine-tuned onto other language tasks

References#

Parameters#

pretrained_model_namestr

Huggingface model directory containing the pretrained model for BERT by default “distilbert-base-uncased” [2]

num_classesint, optional

Number of prediction classes, by default 2

dropout_ratefloat, optional

Dropout rate for the embeddings of bert, helps in fine tuning, by default 0.2

num_train_layersint, optional

Number of Bert layers to fine-tune. Minimum number is 1, by default 1

fit(x_train: Dataset[str | list[str]], y_train: Tensor, sample_weight: Tensor | None = None, batch_size: int = 32, epochs: int = 1, lr: float = 0.001)#

Fit the model on the training data.

Fine tunes a pre-trained BERT model on an input Sequence[str] by tokenizing the inputs and then fine tuning the last few layers of BERT and the classifier.

Parameters#

x_trainDataset[str]

Training data set of sentences or list[str] to be classified

y_traintorch.Tensor

Data Labels

sample_weighttorch.Tensor, optional

Weights associated with each data point, must be passed in as key word arg, by default None

batch_sizeint, optional

Training batch size, by default 2

epochsint, optional

Number of training epochs, by default 1

lrfloat, optional

Learning rate for the Model, by default 0.01

Returns#

selfobject

Trained BERT classifier

forward(input_ids: Tensor, attention_mask: Tensor | None = None)#

Forward pass through DistilBert with inputs from DistilBERT tokenizer output.

NOTE this is only applicable for a DistilBERT model that doesn’t require token_type_ids.

Parameters#

input_idstorch.Tensor

List of token ids to be fed to a model. [Input IDs?](https://huggingface.co/transformers/glossary#input-ids)

attention_masktorch.Tensor

List of indices specifying which tokens should be attended to by the model, by default None [Attention?](https://huggingface.co/transformers/glossary#attention-mask)

Returns#

torch.Tensor

Predicted labels for the classification problem

predict(x: Dataset[str | list[str]])#

Predict output from input sentences/tokens.

Parameters#

xDataset[str | list[str]]

Input data set of sentences or list[str]

Returns#

torch.Tensor

Predicted labels as a tensor

tokenize(sentences: Sequence[str | list[str]]) TensorDataset#

Convert sequence of sentences or tokens into DistilBERT inputs.

Given a sequence of sentences or tokens, computes the input_ids, and attention_masks and loads them on their respective tensor device. Any changes made to the tokenizer should be reflected here and the .forward() method.

Parameters#

sentencesSequence[str | list[str]]

Sequence of sentences or tokens to be transformed into inputs for BERT.

Returns#

TensorDataset

2 tensors representing input_ids and attention_masks. For more in-depth on what each these represent:

If using a non-DistilBert tokenizer, see the below. The token type ids aren’t needed for DistilBert models. - token_type_ids – List of token type ids to be fed to a model

(when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names). [Type IDs?](https://huggingface.co/transformers/glossary#token-type-ids)

opendataval.model.grad module#

class opendataval.model.grad.GradientModel#

Bases: Model

Provides access to gradients of a Model

TODO Some data evaluators may benefit from higher-order gradients or hessians.

abstract grad(x_data: Tensor | Dataset, y_train: Tensor | Dataset, *args, **kwargs) Iterator[tuple[Tensor, ...]]#

Given input data, iterates through the computed gradients of the model.

Will yield a tuple with gradients for each layer of the model for each input data. The data the underlying model is trained on does not have to be the data the gradients of the model are computed for. An iterator is used because storing the computed gradient for each data point use up lots of memory.

Parameters#

x_dataUnion[torch.Tensor, Dataset]

Data covariates

y_dataUnion[torch.Tensor, Dataset]

Data labels

Yields#

Iterator[tuple[torch.Tensor, …]]

Computed gradients (for each layer as tuple) yielded by data point in order

class opendataval.model.grad.TorchGradMixin(*args, **kwargs)#

Bases: GradientModel, TorchModel

Gradient Mixin for Torch Neural Networks.

grad(x_data: Tensor | Dataset, y_data: Tensor | Dataset) Iterator[tuple[Tensor, ...]]#

Given input data, yields the computed gradients for a torch model

Parameters#

x_dataUnion[torch.Tensor, Dataset]

Data covariates

y_dataUnion[torch.Tensor, Dataset]

Data labels

Yields#

Iterator[tuple[torch.Tensor, …]]

Computed gradients (for each layer as tuple) yielded by data point in order

training: bool#

opendataval.model.lenet module#

class opendataval.model.lenet.LeNet(num_classes: int, gray_scale: bool = True)#

Bases: TorchClassMixin, TorchPredictMixin

LeNet-5 convolutional neural net classifier.

Consists of 2 5x5 convolution kernels and a MLP classifier. LeNet-5 was one of the earliest conceived CNNs and was typically applied to digit analysis. LeNet-5 can but doesn’t generalize particularly well to higher dimension (such as color) images.

References#

Parameters#

num_classesint

Number of prediction classes

gray_scalebool, optional

Whether the input image is gray scaled. LeNet has been noted to not perform as well with color, so disable gray_scale at your own risk, by default True

forward(x: Tensor)#

Forward pass of LeNet-5.

opendataval.model.logistic_regression module#

class opendataval.model.logistic_regression.LogisticRegression(input_dim: int, num_classes: int)#

Bases: TorchClassMixin, TorchPredictMixin, TorchGradMixin

Initialize LogisticRegression

Parameters#

input_dimint

Size of the input dimension of the LogisticRegression

num_classesint

Size of the output dimension of the LR, outputs selection probabilities

forward(x: Tensor) Tensor#

Forward pass of Logistic Regression.

Parameters#

xtorch.Tensor

Input tensor

Returns#

torch.Tensor

Output Tensor of logistic regression

opendataval.model.mlp module#

class opendataval.model.mlp.ClassifierMLP(input_dim: int, num_classes: int, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#

Bases: TorchClassMixin, TorchPredictMixin, TorchGradMixin

Initializes the Multilayer Perceptron Classifier.

Parameters#

input_dimint

Size of the input dimension of the MLP

num_classesint

Size of the output dimension of the MLP, outputs selection probabilities

layersint, optional

Number of layers for the MLP, by default 5

hidden_dimint, optional

Hidden dimension for the MLP, by default 25

act_fnCallable, optional

Activation function for MLP, if none, set to nn.ReLU, by default None

forward(x: Tensor) Tensor#

Forward pass of MLP Neural Network.

Parameters#

xtorch.Tensor

Input tensor

Returns#

torch.Tensor

Output Tensor of MLP

class opendataval.model.mlp.RegressionMLP(input_dim: int, num_classes: int = 1, layers: int = 5, hidden_dim: int = 25, act_fn: Callable | None = None)#

Bases: TorchRegressMixin, TorchPredictMixin, TorchGradMixin

Initializes the Multilayer Perceptron Regression.

Parameters#

input_dimint

Size of the input dimension of the MLP

num_classesint

Size of the output dimension of the MLP, >1 means multi dimension output

layersint, optional

Number of layers for the MLP, by default 5

hidden_dimint, optional

Hidden dimension for the MLP, by default 25

act_fnCallable, optional

Activation function for MLP, if none, set to nn.ReLU, by default None

forward(x: Tensor) Tensor#

Forward pass of Multilayer Perceptron.

Parameters#

xtorch.Tensor

Input tensor

Returns#

torch.Tensor

Output Tensor of MLP

Module contents#

Prediction models to be trained, predict, and evaluated.

Models#

Model is an ABC used to take an existing model and make it compatible with the DataEvaluator and other related objects.

API#

Model()

Abstract class of Models.

GradientModel()

Provides access to gradients of a Model

ModelFactory(model_name[, fetcher, device])

Factory to create prediction models from specified presets

Torch Mixins#

TorchClassMixin(*args, **kwargs)

Classifier Mixin for Torch Neural Networks.

TorchRegressMixin(*args, **kwargs)

Regressor Mixin for Torch Neural Networks.

TorchPredictMixin(*args, **kwargs)

Torch .predict() method mixin for Torch Neural Networks.

TorchGradMixin(*args, **kwargs)

Gradient Mixin for Torch Neural Networks.

Sci-kit learn wrappers#

ClassifierSkLearnWrapper(base_model, ...)

Wrapper for sk-learn classifiers that can have weighted fit methods.

ClassifierUnweightedSkLearnWrapper(...)

Wrapper for sk-learn classifiers that can don't have weighted fit methods.

RegressionSkLearnWrapper(base_model, *args, ...)

Wrapper for sk-learn regression models.

Default Hyperparameters#

\[ \begin{align}\begin{aligned}\newcommand\T{\Rule{0pt}{1em}{.3em}}\\\begin{split}\begin{array}{llll} \hline \textbf{Algorithm} & \textbf{Hyperparameter} & \textbf{Default Value} & \textbf{Key word argument} \\ \hline \mbox{Logistic Regression} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{MLP Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{BERT Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.001 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAMW} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{LeNet-5 Classification} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Cross Entropy} & \mbox{no} \\ \hline \mbox{MLP Regression} & \mbox{epochs} & 1 & \mbox{yes} \\ & \mbox{batch size} & 32 & \mbox{yes} \\ & \mbox{learning rate} & 0.01 & \mbox{yes} \\ & \mbox{optimizer} & \mbox{ADAM} & \mbox{no} \\ & \mbox{loss function} & \mbox{Mean Square Error} & \mbox{no} \\ \hline \end{array}\end{split}\end{aligned}\end{align} \]
opendataval.model.ModelFactory(model_name: str, fetcher: DataFetcher | None = None, device: device = device(type='cpu'), *args, **kwargs) Model#

Factory to create prediction models from specified presets

Model Factory that creates a specified mode, based on the input parameters, it is recommended to import the specific model and specify additional arguments instead of relying on the factory.

Parameters#

model_namestr

Name of prediction model

covar_dimtuple[int, …]

Dimensions of the covariates, typically the shape besides first dimension

label_dimtuple[int, …]

Dimensions of the labels, typically the shape besides first dimension

devicetorch.device, optional

Tensor device for acceleration, some models do not use this argument, by default torch.device(“cpu”)

argstuple[Any]

Additional positional arguments passed to the Model constructor

kwargstuple[Any]

Additional key word arguments passed to the Model constructor

Returns#

Model

Preset model with the specified dimensions on the specified tensor device

Raises#

ValueError

Raises exception when model name is not matched