deepmol.models package

Submodules

deepmol.models.base_models module

basic_multitask_dnn(input_shape, task_names, losses, metrics)[source]

create_dense_model(input_dim: int = 1024, n_hidden_layers: int = 1, layers_units: Optional[List[int]] = None, dropouts: Optional[List[float]] = None, activations: Optional[List[str]] = None, batch_normalization: Optional[List[bool]] = None, l1_l2: Optional[List[float]] = None, loss: str = 'binary_crossentropy', optimizer: str = 'adam', metrics: Optional[List[str]] = None)[source]

Builds a dense neural network model.

Parameters

input_dim (int) – Number of features.
n_hidden_layers (int) – Number of hidden layers.
layers_units (List[int]) – Number of units in each hidden layer.
dropouts (List[float]) – Dropout rate in each hidden layer.
activations (List[str]) – Activation function in each hidden layer.
batch_normalization (List[bool]) – Whether to use batch normalization in each hidden layer.
l1_l2 (List[float]) – L1 and L2 regularization in each hidden layer.
loss (str) – Loss function.
optimizer (str) – Optimizer.
metrics (List[str]) – Metrics to be evaluated by the model during training and testing.

Returns

model – Dense neural network model.

Return type

Sequential

make_cnn_model(input_dim: int = 1024, g_noise: float = 0.05, DENSE: int = 128, DROPOUT: float = 0.5, C1_K: int = 8, C1_S: int = 32, C2_K: int = 16, C2_S: int = 32, activation: str = 'relu', loss: str = 'binary_crossentropy', optimizer: str = 'adadelta', learning_rate: float = 0.01, metrics: Union[str, List[str]] = 'accuracy')[source]

Builds a 1D convolutional neural network model.

Parameters

input_dim (int) – Number of features.
g_noise (float) – Gaussian noise.
DENSE (int) – Number of units in the dense layer.
DROPOUT (float) – Dropout rate.
C1_K (int) – The dimensionality of the output space (i.e. the number of output filters in the convolution) of the first convolutional layer.
C1_S (int) – Kernel size specifying the length of the 1D convolution window of the first convolutional layer.
C2_K (int) – The dimensionality of the output space (i.e. the number of output filters in the convolution) of the second convolutional layer.
C2_S (int) – Kernel size specifying the length of the 1D convolution window of the second convolutional layer.
activation (str) – Activation function of the Conv1D and Dense layers.
loss (str) – Loss function.
optimizer (str) – Optimizer.
learning_rate (float) – Learning rate.
metrics (Union[str, List[str]]) – Metrics to be evaluated by the model during training and testing.

rf_model_builder(n_estimators: int = 100, max_features: Union[int, float, str] = 'auto', class_weight: Optional[dict] = None)[source]

Builds a random forest model.

Parameters

n_estimators (int) – Number of trees in the forest.
max_features (Union[int, float, str]) – The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If “auto”, then max_features=sqrt(n_features). - If “sqrt”, then max_features=sqrt(n_features). - If “log2”, then max_features=log2(n_features). - If None, then max_features=n_features.
class_weight (dict) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

Returns

rf_model – Random forest model.

Return type

RandomForestClassifier

svm_model_builder(C: float = 1.0, gamma: Union[str, float] = 'auto', kernel: Union[str, callable] = 'rfb')[source]

Builds a support vector machine model.

Parameters

C (float) – Penalty parameter C of the error term.
gamma (Union[str, float]) –
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
- if ‘scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma;
- if ‘auto’, uses 1 / n_features.
kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.

Returns

svm_model – Support vector machine model.

Return type

SVC

deepmol.models.deepchem_models module

class DeepChemModel(model: Model, model_dir: Optional[str] = None, **kwargs)[source]

Bases: Model

Wrapper class that wraps deepchem models. The DeepChemModel class provides a wrapper around deepchem models that allows deepchem models to be trained on Dataset objects and evaluated with the metrics in Metrics.

cross_validate(dataset: Dataset, metric: Metric, splitter: Splitter, transformers: Optional[List[NormalizationTransformer]] = None, folds: int = 3)[source]

Cross validates the model on the specified dataset.

Parameters

dataset (Dataset) – Dataset to cross validate on.
metric (Metric) – Metric to evaluate the model on.
splitter (Splitter) – Splitter to split the dataset into train and test sets.
transformers (List[Transformer]) – Transformers that the input data has been transformed by.
folds (int) – Number of folds to use for cross validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[DeepChemModel, float, float, List[float], List[float], float, float]

evaluate(dataset: Dataset, metrics: List[Metric], per_task_metrics: bool = False)[source]

Evaluates the performance of the model on the provided dataset.

Parameters

dataset (Dataset) – Dataset to evaluate the model on.
metrics (List[Metric]) – Metrics to evaluate the model on.
per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.

Returns

multitask_scores: dict: Dictionary mapping names of metrics to metric scores.
all_task_scores: dict: If per_task_metrics == True, then returns a second dictionary of scores for each task separately.

Return type

Tuple[Dict, Dict]

fit(dataset: Dataset) → None[source]

Fits DeepChemModel to data.

Parameters: dataset (Dataset) – The Dataset to train this model on.

fit_on_batch(X: Sequence, y: Sequence, w: Sequence)[source]

Fits the model on a batch of data.

Parameters

X (Sequence) – The input data.
y (Sequence) – The output data.
w (Sequence) – The weights for the data.

get_num_tasks() → int[source]

Returns the number of tasks of the model.

Returns: The number of tasks of the model.
Return type: int

get_task_type() → str[source]

Returns the task type of the model.

Returns: The task type of the model.
Return type: str

predict(dataset: Dataset, transformers: Optional[List[NormalizationTransformer]] = None) → ndarray[source]

Makes predictions on dataset.

Parameters

dataset (Dataset) – Dataset to make prediction on.
transformers (List[Transformer]) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.

Returns

The value is a return value of predict method of the DeepChem model.

Return type

np.ndarray

predict_on_batch(dataset: Dataset) → ndarray[source]

Makes predictions on batch of data.

Parameters: dataset (Dataset) – Dataset to make prediction on.

reload()[source]: Loads deepchem model from joblib file on disk.

save()[source]: Saves deepchem model to disk using joblib.

generate_sequences(epochs: int, train_smiles: List[Union[str, int]])[source]

Function to generate the input/output pairs for SeqToSeq model. Taken from DeepChem tutorials.

Parameters

epochs (int) – Number of epochs to train the model.
train_smiles (List[str]) – The ids of the samples in the dataset (smiles)

Return type

yields a pair of smile strings for epochs x len(train_smiles)

deepmol.models.ensembles module

class Ensemble(models: List[Model])[source]

Bases: ABC

Abstract class for ensembles of models.

evaluate(dataset: Dataset, metrics: List[Metric], per_task_metrics: bool = False, n_classes: int = 2)[source]

Evaluates the performance of this model on specified dataset.

Parameters

dataset (Dataset) – Dataset object.
metrics (List[Metric]) – The set of metrics provided.
per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.
n_classes (int) – If specified, will use n_classes as the number of unique classes.

Returns

multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
all_task_scores (dict, optional) – If per_task_metrics == True is passed as a keyword argument, then returns a second dictionary of scores for each task separately.

fit(dataset: Dataset)[source]: Fits the models to the specified dataset.

abstract predict(dataset: Dataset)[source]: Predicts the labels for the specified dataset.

class VotingClassifier(models: List[Model], voting: str = 'soft')[source]

Bases: Ensemble

VotingClassifier Ensemble. It uses a voting strategy to predict the labels of a dataset.

predict(dataset: Dataset, proba: bool = False)[source]

Predicts the labels for the specified dataset.

Parameters

dataset (Dataset) – Dataset object.
proba (bool) – If true, returns the probabilities instead of class labels.

Returns

final_result – Predicted labels or probabilities.

Return type

np.ndarray

deepmol.models.keras_models module

class KerasModel(model_builder: callable, mode: str = 'classification', model_dir: Optional[str] = None, loss: str = 'binary_crossentropy', optimizer: str = 'adam', learning_rate: float = 0.001, epochs: int = 150, batch_size: int = 10, verbose: int = 0, **kwargs)[source]

Bases: Model

Wrapper class that wraps keras models. The KerasModel class provides a wrapper around keras models that allows this models to be trained on Dataset objects.

cross_validate(dataset: Dataset, metric: Metric, folds: int = 3)[source]

Cross validates the model on a dataset.

Parameters

dataset (Dataset) – The Dataset to cross validate on.
metric (Metric) – The metric to use for cross validation.
folds (int) – The number of folds to use for cross validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[SKlearnModel, float, float, List[float], List[float], float, float]

fit(dataset: Dataset, **kwargs) → None[source]

Fits keras model to data.

Parameters

dataset (Dataset) – The Dataset to train this model on.
kwargs – Additional arguments to pass to fit method of the keras model.

fit_on_batch(X: Sequence, y: Sequence)[source]: Fits model on batch of data.

get_num_tasks() → int[source]: Returns the number of tasks of the model.

get_task_type() → str[source]: Returns the task type of the model.

predict(dataset: Dataset) → ndarray[source]

Makes predictions on dataset.

Parameters: dataset (Dataset) – Dataset to make prediction on.
Returns: The value is a return value of predict_proba or predict method of the scikit-learn model. If the scikit-learn model has both methods, the value is always a return value of predict_proba.
Return type: np.ndarray

predict_on_batch(X: Dataset) → ndarray[source]

Makes predictions on batch of data.

Parameters: X (Dataset) – Dataset to make prediction on.
Returns: numpy array of predictions.
Return type: np.ndarray

reload() → None[source]: Reloads the model from disk.

save() → None[source]: Saves the model to disk.

deepmol.models.models module

class Model(model: Optional[BaseEstimator] = None, model_dir: Optional[str] = None, **kwargs)[source]

Bases: BaseEstimator

Abstract base class for ML/DL models.

evaluate(dataset: Dataset, metrics: Union[List[Metric], Metric], per_task_metrics: bool = False) → Tuple[Dict, Union[None, Dict]][source]

Evaluates the performance of this model on specified dataset.

Parameters

dataset (Dataset) – Dataset object.
metrics (Union[List[Metric], Metric]) – The set of metrics provided.
per_task_metrics (bool) – If true, return computed metric for each task on multitask dataset.
kwargs – Additional keyword arguments to pass to Evaluator.compute_model_performance.

Returns

multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
all_task_scores (dict, optional) – If per_task_metrics == True is passed as a keyword argument, then returns a second dictionary of scores for each task separately.

fit(dataset: Dataset)[source]

Fits a model on data in a Dataset object.

Parameters: dataset (Dataset) – the Dataset to train on

fit_on_batch(X: Sequence, y: Sequence)[source]

Perform a single step of training.

Parameters

X (np.ndarray) – the inputs for the batch
y (np.ndarray) – the labels for the batch

static get_model_filename(model_dir: str) → str[source]

Given model directory, obtain filename for the model itself.

Parameters: model_dir (str) – Path to directory where model is stored.
Returns: Path to model file.
Return type: str

get_num_tasks() → int[source]: Get number of tasks.

static get_params_filename(model_dir: str) → str[source]

Given model directory, obtain filename for the model itself.

Parameters: model_dir (str) – Path to directory where model is stored.
Returns: Path to file where model parameters are stored.
Return type: str

get_task_type() → str[source]: Currently models can only be classifiers or regressors.

predict(dataset: Dataset) → ndarray[source]

Uses self to make predictions on provided Dataset object.

Parameters: dataset (Dataset) – Dataset to make prediction on
Returns: A numpy array of predictions.
Return type: np.ndarray

predict_on_batch(X: Sequence)[source]

Makes predictions on given batch of new data.

Parameters: X (np.ndarray) – array of features

predict_proba(dataset: Dataset) → ndarray[source]

reload() → None[source]: Reload trained model from disk.

save() → None[source]: Function for saving models. Each subclass is responsible for overriding this method.

deepmol.models.sklearn_models module

class SklearnModel(model: BaseEstimator, mode: Optional[str] = None, model_dir: Optional[str] = None, **kwargs)[source]

Bases: Model

Wrapper class that wraps scikit-learn models. The SklearnModel class provides a wrapper around scikit-learn models that allows scikit-learn models to be trained on Dataset objects and evaluated with the metrics in Metrics.

cross_validate(dataset: Dataset, metric: Metric, folds: int = 3)[source]

Performs cross-validation on a dataset.

Parameters

dataset (Dataset) – Dataset to perform cross-validation on.
metric (Metric) – Metric to evaluate model performance.
folds (int) – Number of folds to use for cross-validation.

Returns

The first element is the best model, the second is the train score of the best model, the third is the train score of the best model, the fourth is the test scores of all models, the fifth is the average train scores of all folds and the sixth is the average test score of all folds.

Return type

Tuple[SKlearnModel, float, float, List[float], List[float], float, float]

fit(dataset: Dataset) → None[source]

Fits scikit-learn model to data.

Parameters: dataset (Dataset) – The Dataset to train this model on.
Returns: The trained scikit-learn model.
Return type: BaseEstimator

fit_on_batch(X: Sequence, y: Sequence)[source]: Fits model on batch of data.

get_num_tasks() → int[source]: Returns the number of tasks.

get_task_type() → str[source]: Returns the task type of the model.

predict(dataset: Dataset) → ndarray[source]

Makes predictions on dataset.

Parameters: dataset (Dataset) – Dataset to make prediction on.
Returns: The value is a return value of predict_proba or predict method of the scikit-learn model. If the scikit-learn model has both methods, the value is always a return value of predict_proba.
Return type: np.ndarray

predict_on_batch(dataset: Dataset) → ndarray[source]

Makes predictions on batch of data.

Parameters: dataset (Dataset) – Dataset to make prediction on.
Returns: numpy array of predictions.
Return type: np.ndarray

reload()[source]: Loads scikit-learn model from joblib file on disk.

save()[source]: Saves scikit-learn model to disk using joblib.

deepmol.models package

Submodules

deepmol.models.base_models module

deepmol.models.deepchem_models module

deepmol.models.ensembles module

deepmol.models.keras_models module

deepmol.models.models module

deepmol.models.sklearn_models module

Module contents