deepmol.models package

Submodules

deepmol.models.base_models module

basic_multitask_dnn(input_shape, task_names, losses, metrics)[source]
create_dense_model(input_dim: int = 1024, n_hidden_layers: int = 1, layers_units: Optional[List[int]] = None, dropouts: Optional[List[float]] = None, activations: Optional[List[str]] = None, batch_normalization: Optional[List[bool]] = None, l1_l2: Optional[List[float]] = None, loss: str = 'binary_crossentropy', optimizer: str = 'adam', metrics: Optional[List[str]] = None)[source]

Builds a dense neural network model.

Parameters
  • input_dim (int) – Number of features.

  • n_hidden_layers (int) – Number of hidden layers.

  • layers_units (List[int]) – Number of units in each hidden layer.

  • dropouts (List[float]) – Dropout rate in each hidden layer.

  • activations (List[str]) – Activation function in each hidden layer.

  • batch_normalization (List[bool]) – Whether to use batch normalization in each hidden layer.

  • l1_l2 (List[float]) – L1 and L2 regularization in each hidden layer.

  • loss (str) – Loss function.

  • optimizer (str) – Optimizer.

  • metrics (List[str]) – Metrics to be evaluated by the model during training and testing.

Returns

model – Dense neural network model.

Return type

Sequential

make_cnn_model(input_dim: int = 1024, g_noise: float = 0.05, DENSE: int = 128, DROPOUT: float = 0.5, C1_K: int = 8, C1_S: int = 32, C2_K: int = 16, C2_S: int = 32, activation: str = 'relu', loss: str = 'binary_crossentropy', optimizer: str = 'adadelta', learning_rate: float = 0.01, metrics: Union[str, List[str]] = 'accuracy')[source]

Builds a 1D convolutional neural network model.

Parameters
  • input_dim (int) – Number of features.

  • g_noise (float) – Gaussian noise.

  • DENSE (int) – Number of units in the dense layer.

  • DROPOUT (float) – Dropout rate.

  • C1_K (int) – The dimensionality of the output space (i.e. the number of output filters in the convolution) of the first convolutional layer.

  • C1_S (int) – Kernel size specifying the length of the 1D convolution window of the first convolutional layer.

  • C2_K (int) – The dimensionality of the output space (i.e. the number of output filters in the convolution) of the second convolutional layer.

  • C2_S (int) – Kernel size specifying the length of the 1D convolution window of the second convolutional layer.

  • activation (str) – Activation function of the Conv1D and Dense layers.

  • loss (str) – Loss function.

  • optimizer (str) – Optimizer.

  • learning_rate (float) – Learning rate.

  • metrics (Union[str, List[str]]) – Metrics to be evaluated by the model during training and testing.

rf_model_builder(n_estimators: int = 100, max_features: Union[int, float, str] = 'auto', class_weight: Optional[dict] = None)[source]

Builds a random forest model.

Parameters
  • n_estimators (int) – Number of trees in the forest.

  • max_features (Union[int, float, str]) – The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If “auto”, then max_features=sqrt(n_features). - If “sqrt”, then max_features=sqrt(n_features). - If “log2”, then max_features=log2(n_features). - If None, then max_features=n_features.

  • class_weight (dict) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

Returns

rf_model – Random forest model.

Return type

RandomForestClassifier

svm_model_builder(C: float = 1.0, gamma: Union[str, float] = 'auto', kernel: Union[str, callable] = 'rfb')[source]

Builds a support vector machine model.

Parameters
  • C (float) – Penalty parameter C of the error term.

  • gamma (Union[str, float]) –

    Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
    • if ‘scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma;

    • if ‘auto’, uses 1 / n_features.

  • kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.

Returns

svm_model – Support vector machine model.

Return type

SVC

deepmol.models.deepchem_models module

class DeepChemModel(model: Model, model_dir: Optional[str] = None, **kwargs)[source]

Bases: Model

Wrapper class that wraps deepchem models. The DeepChemModel class provides a wrapper around deepchem models that allows deepchem models to be trained on Dataset objects and evaluated with the metrics in Metrics.

cross_validate(dataset: Dataset, metric: Metric, splitter: Splitter, transformers: Optional[List[NormalizationTransformer]] = None, folds: int = 3)[source]

Cross validates the model on the specified dataset.

Parameters
  • dataset (Dataset) – Dataset to cross validate on.

  • metric (Metric) – Metric to evaluate the model on.

  • splitter (Splitter) – Splitter to split the dataset into train and test sets.

  • transformers (List[Transformer]) – Transformers that the input data has been transformed by.

  • folds (int) – Number of folds to use for cross validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[DeepChemModel, float, float, List[float], List[float], float, float]

evaluate(dataset: Dataset, metrics: List[Metric], per_task_metrics: bool = False)[source]

Evaluates the performance of the model on the provided dataset.

Parameters
  • dataset (Dataset) – Dataset to evaluate the model on.

  • metrics (List[Metric]) – Metrics to evaluate the model on.

  • per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.

Returns

multitask_scores: dict

Dictionary mapping names of metrics to metric scores.

all_task_scores: dict

If per_task_metrics == True, then returns a second dictionary of scores for each task separately.

Return type

Tuple[Dict, Dict]

fit(dataset: Dataset) None[source]

Fits DeepChemModel to data.

Parameters

dataset (Dataset) – The Dataset to train this model on.

fit_on_batch(X: Sequence, y: Sequence, w: Sequence)[source]

Fits the model on a batch of data.

Parameters
  • X (Sequence) – The input data.

  • y (Sequence) – The output data.

  • w (Sequence) – The weights for the data.

get_num_tasks() int[source]

Returns the number of tasks of the model.

Returns

The number of tasks of the model.

Return type

int

get_task_type() str[source]

Returns the task type of the model.

Returns

The task type of the model.

Return type

str

predict(dataset: Dataset, transformers: Optional[List[NormalizationTransformer]] = None) ndarray[source]

Makes predictions on dataset.

Parameters
  • dataset (Dataset) – Dataset to make prediction on.

  • transformers (List[Transformer]) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.

Returns

The value is a return value of predict method of the DeepChem model.

Return type

np.ndarray

predict_on_batch(dataset: Dataset) ndarray[source]

Makes predictions on batch of data.

Parameters

dataset (Dataset) – Dataset to make prediction on.

reload()[source]

Loads deepchem model from joblib file on disk.

save()[source]

Saves deepchem model to disk using joblib.

generate_sequences(epochs: int, train_smiles: List[Union[str, int]])[source]

Function to generate the input/output pairs for SeqToSeq model. Taken from DeepChem tutorials.

Parameters
  • epochs (int) – Number of epochs to train the model.

  • train_smiles (List[str]) – The ids of the samples in the dataset (smiles)

Return type

yields a pair of smile strings for epochs x len(train_smiles)

deepmol.models.ensembles module

class Ensemble(models: List[Model])[source]

Bases: ABC

Abstract class for ensembles of models.

evaluate(dataset: Dataset, metrics: List[Metric], per_task_metrics: bool = False, n_classes: int = 2)[source]

Evaluates the performance of this model on specified dataset.

Parameters
  • dataset (Dataset) – Dataset object.

  • metrics (List[Metric]) – The set of metrics provided.

  • per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.

  • n_classes (int) – If specified, will use n_classes as the number of unique classes.

Returns

  • multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.

  • all_task_scores (dict, optional) – If per_task_metrics == True is passed as a keyword argument, then returns a second dictionary of scores for each task separately.

fit(dataset: Dataset)[source]

Fits the models to the specified dataset.

abstract predict(dataset: Dataset)[source]

Predicts the labels for the specified dataset.

class VotingClassifier(models: List[Model], voting: str = 'soft')[source]

Bases: Ensemble

VotingClassifier Ensemble. It uses a voting strategy to predict the labels of a dataset.

predict(dataset: Dataset, proba: bool = False)[source]

Predicts the labels for the specified dataset.

Parameters
  • dataset (Dataset) – Dataset object.

  • proba (bool) – If true, returns the probabilities instead of class labels.

Returns

final_result – Predicted labels or probabilities.

Return type

np.ndarray

deepmol.models.keras_models module

class KerasModel(model_builder: callable, mode: str = 'classification', model_dir: Optional[str] = None, loss: str = 'binary_crossentropy', optimizer: str = 'adam', learning_rate: float = 0.001, epochs: int = 150, batch_size: int = 10, verbose: int = 0, **kwargs)[source]

Bases: Model

Wrapper class that wraps keras models. The KerasModel class provides a wrapper around keras models that allows this models to be trained on Dataset objects.

cross_validate(dataset: Dataset, metric: Metric, folds: int = 3)[source]

Cross validates the model on a dataset.

Parameters
  • dataset (Dataset) – The Dataset to cross validate on.

  • metric (Metric) – The metric to use for cross validation.

  • folds (int) – The number of folds to use for cross validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[SKlearnModel, float, float, List[float], List[float], float, float]

fit(dataset: Dataset, **kwargs) None[source]

Fits keras model to data.

Parameters
  • dataset (Dataset) – The Dataset to train this model on.

  • kwargs – Additional arguments to pass to fit method of the keras model.

fit_on_batch(X: Sequence, y: Sequence)[source]

Fits model on batch of data.

get_num_tasks() int[source]

Returns the number of tasks of the model.

get_task_type() str[source]

Returns the task type of the model.

predict(dataset: Dataset) ndarray[source]

Makes predictions on dataset.

Parameters

dataset (Dataset) – Dataset to make prediction on.

Returns

The value is a return value of predict_proba or predict method of the scikit-learn model. If the scikit-learn model has both methods, the value is always a return value of predict_proba.

Return type

np.ndarray

predict_on_batch(X: Dataset) ndarray[source]

Makes predictions on batch of data.

Parameters

X (Dataset) – Dataset to make prediction on.

Returns

numpy array of predictions.

Return type

np.ndarray

reload() None[source]

Reloads the model from disk.

save() None[source]

Saves the model to disk.

deepmol.models.models module

class Model(model: Optional[BaseEstimator] = None, model_dir: Optional[str] = None, **kwargs)[source]

Bases: BaseEstimator

Abstract base class for ML/DL models.

evaluate(dataset: Dataset, metrics: Union[List[Metric], Metric], per_task_metrics: bool = False) Tuple[Dict, Union[None, Dict]][source]

Evaluates the performance of this model on specified dataset.

Parameters
  • dataset (Dataset) – Dataset object.

  • metrics (Union[List[Metric], Metric]) – The set of metrics provided.

  • per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.

  • kwargs – Additional keyword arguments to pass to Evaluator.compute_model_performance.

Returns

  • multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.

  • all_task_scores (dict, optional) – If per_task_metrics == True is passed as a keyword argument, then returns a second dictionary of scores for each task separately.

fit(dataset: Dataset)[source]

Fits a model on data in a Dataset object.

Parameters

dataset (Dataset) – the Dataset to train on

fit_on_batch(X: Sequence, y: Sequence)[source]

Perform a single step of training.

Parameters
  • X (np.ndarray) – the inputs for the batch

  • y (np.ndarray) – the labels for the batch

static get_model_filename(model_dir: str) str[source]

Given model directory, obtain filename for the model itself.

Parameters

model_dir (str) – Path to directory where model is stored.

Returns

Path to model file.

Return type

str

get_num_tasks() int[source]

Get number of tasks.

static get_params_filename(model_dir: str) str[source]

Given model directory, obtain filename for the model itself.

Parameters

model_dir (str) – Path to directory where model is stored.

Returns

Path to file where model parameters are stored.

Return type

str

get_task_type() str[source]

Currently models can only be classifiers or regressors.

predict(dataset: Dataset) ndarray[source]

Uses self to make predictions on provided Dataset object.

Parameters

dataset (Dataset) – Dataset to make prediction on

Returns

A numpy array of predictions.

Return type

np.ndarray

predict_on_batch(X: Sequence)[source]

Makes predictions on given batch of new data.

Parameters

X (np.ndarray) – array of features

predict_proba(dataset: Dataset) ndarray[source]
reload() None[source]

Reload trained model from disk.

save() None[source]

Function for saving models. Each subclass is responsible for overriding this method.

deepmol.models.sklearn_models module

class SklearnModel(model: BaseEstimator, mode: Optional[str] = None, model_dir: Optional[str] = None, **kwargs)[source]

Bases: Model

Wrapper class that wraps scikit-learn models. The SklearnModel class provides a wrapper around scikit-learn models that allows scikit-learn models to be trained on Dataset objects and evaluated with the metrics in Metrics.

cross_validate(dataset: Dataset, metric: Metric, folds: int = 3)[source]

Performs cross-validation on a dataset.

Parameters
  • dataset (Dataset) – Dataset to perform cross-validation on.

  • metric (Metric) – Metric to evaluate model performance.

  • folds (int) – Number of folds to use for cross-validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[SKlearnModel, float, float, List[float], List[float], float, float]

fit(dataset: Dataset) None[source]

Fits scikit-learn model to data.

Parameters

dataset (Dataset) – The Dataset to train this model on.

Returns

The trained scikit-learn model.

Return type

BaseEstimator

fit_on_batch(X: Sequence, y: Sequence)[source]

Fits model on batch of data.

get_num_tasks() int[source]

Returns the number of tasks.

get_task_type() str[source]

Returns the task type of the model.

predict(dataset: Dataset) ndarray[source]

Makes predictions on dataset.

Parameters

dataset (Dataset) – Dataset to make prediction on.

Returns

The value is a return value of predict_proba or predict method of the scikit-learn model. If the scikit-learn model has both methods, the value is always a return value of predict_proba.

Return type

np.ndarray

predict_on_batch(dataset: Dataset) ndarray[source]

Makes predictions on batch of data.

Parameters

dataset (Dataset) – Dataset to make prediction on.

Returns

numpy array of predictions.

Return type

np.ndarray

reload()[source]

Loads scikit-learn model from joblib file on disk.

save()[source]

Saves scikit-learn model to disk using joblib.

Module contents