DeepMol Pipeline Optimization (AutoML)

In DeepMol we can optimize Pipelines using the PipelineOptimization class. This class uses the Optuna library to optimize the hyperparameters of the pipeline. It is possible to optimize between different Transformers (e.g. featurizers, scalers, etc.) and Predictors (i.e. models) and also to optimize their hyperparameters.

As with any optimization problem, pipeline optimization requires an objective function. By default, DeepMol sets the goal of maximizing or minimizing a chosen metric (such as accuracy) for a specified validation set. Nonetheless, users can integrate additional custom objective functions as needed.

Furthermore, a function can be created with the configuration space that will be optimized by Optuna. This function will receive a Trial object from Optuna and will return the steps (i.e. Transformers and Predictors) of the pipeline. Alternatively, we can use a set of predefined functions (presets) that are already implemented in DeepMol. The following presets are available:

'sklearn': optimizes between all available Standardizers, Featurizers, Scalers, Feature Selectors in DeepMol and all available models in Sklearn.
'deepchem': optimizes between all available Standardizers and all available models in DeepChem (depending on the model different Featurizers, Scalers and Feature Selectors may also be optimized).
'keras': optimizes between Standardizers, Featurizers, Scalers, Feature Selectors in DeepMol and some base Keras models (1D CNNs, RNNs, Bidirectional RNNs and FCNNs).
'all': optimizes between all the above presets.

In the case of using a custom function, the function must return a list of tuples where the first element of the tuple is the name of the step and the second element is the object that implements the step. The pipeline class will respect the order of the steps you return in the objective function. For example, if you return the following list of tuples:

steps = [('featurizer', featurizer), ('scaler', scaler), ('model', model)]

The pipeline will first apply the featurizer, then the scaler and finally the model.

Example using a custom steps function

In the following example we will assume that you already have the data processed (ready for training). We will only use DeepMol’s PipelineOptimization class to optimize the final model. In this case we will optimize between a Random Forest and a SVC model. We will also optimize the hyperparameters of the models.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

from deepmol.loaders import CSVLoader
from deepmol.metrics import Metric
from deepmol.models import SklearnModel
from deepmol.pipeline_optimization import PipelineOptimization
from deepmol.splitters import RandomSplitter

from deepmol.pipeline_optimization.objective_wrapper import ObjectiveTrainEval

# DEFINE THE FUNCTION
def steps(trial):
    model = trial.suggest_categorical('model', ['RandomForestClassifier', 'SVC'])
    if model == 'RandomForestClassifier':
        n_estimators = trial.suggest_int('model__n_estimators', 10, 100, step=10)
        model = RandomForestClassifier(n_estimators=n_estimators)
    elif model == 'SVC':
        kernel = trial.suggest_categorical('model__kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
        model = SVC(kernel=kernel)
    model = SklearnModel(model=model, model_dir='model')
    steps = [('model', model)]
    return steps
 
# LOAD THE DATA   
loader = CSVLoader('data_path...',
                   smiles_field='mols',
                   id_field='ids',
                   labels_fields=['y'],
                   features_fields=['f1', 'f2', 'f3', '...'],
                   mode='classification')
dataset_descriptors = loader.create_dataset(sep=",")
   
# OPTIMIZE THE PIPELINE 
po = PipelineOptimization(direction='maximize', study_name='test_predictor_pipeline')
metric = Metric(accuracy_score)
train, test = RandomSplitter().train_test_split(dataset_descriptors, seed=123)
po.optimize(train_dataset=train, test_dataset=test, objective_steps=steps, objective = ObjectiveTrainEval,
            metric=metric, n_trials=5, save_top_n=3, trial_timeout=600)

This will optimize between the objective function and will save the top 3 pipelines (save_top_n=3) in the study_name folder. The direction parameter indicates if we want to maximize or minimize the metric. In this case we want to maximize the accuracy. The n_trials parameter indicates the number of trials that will be performed by Optuna. Moreover, if the trial_timeout parameter is set, the trial will stop after the given number of seconds.

Additionally, we can also provide a storage (see: https://optuna.readthedocs.io/en/stable/reference/storages.html), a sampler (https://optuna.readthedocs.io/en/stable/reference/samplers/index.html) and a pruner (https://optuna.readthedocs.io/en/stable/reference/pruners.html) to use in the optimization.

The best parameters can be retrieved using the best_params property:

best_params = po.best_params

The best trial can be retrieved using the best_trial property:

best_trial = po.best_trial

The best score can be retrieved using the best_value property:

best_score = po.best_value

All the trials can be retrieved using the trials property:

trials = po.trials

The best pipeline can be retrieved using the best_pipeline property:

best_pipeline = po.best_pipeline

More information about the trials and respective scores and parameters can be retrieved using the trials_dataframe method.

trials_df = po.trials_dataframe()

Information about the importance of the optimized parameters can be retrieved using the get_param_importances method.

param_importances = po.get_param_importances()

Example using a preset

The following example assumes that you are providing raw SMILES data and respective labels. We will use the 'all' preset.

from deepmol.loaders import CSVLoader
from deepmol.metrics import Metric
from deepmol.pipeline_optimization import PipelineOptimization
from deepmol.splitters import RandomSplitter
from sklearn.metrics import mean_squared_error
import optuna

from deepmol.pipeline_optimization.objective_wrapper import ObjectiveTrainEval

# LOAD THE DATA
loader = CSVLoader('dataset_regression_path',
                   smiles_field='smiles',
                   labels_fields=['pIC50'],
                   mode='regression')
dataset_regression = loader.create_dataset(sep=",")

# OPTIMIZE THE PIPELINE
po = PipelineOptimization(direction='minimize', study_name='test_pipeline', sampler=optuna.samplers.TPESampler(seed=42),
                          storage='sqlite:///my_experience.db')
metric = Metric(mean_squared_error)
train, test = RandomSplitter().train_test_split(dataset_regression, seed=123)
po.optimize(train_dataset=train, test_dataset=test, objective_steps='all', 
            metric=metric, n_trials=10, data=train, save_top_n=2, trial_timeout=600, 
            objective = ObjectiveTrainEval)
            

In this case we are optimizing between all the available steps in DeepMol. In this case we want to minimize the mean squared error (regression task).

The best params and best pipeline can be obtained as in the previous example.

Implement a custom objective class

In the following example we will implement a custom objective class. By default, the objective class is ObjectiveTrainEval. This class will train the pipeline using the train dataset and will evaluate the pipeline using the validation dataset. A new objective class must inherit from Objective and must implement the __call__ method. This method will receive a Trial object from Optuna and must return a dictionary with the following keys:

'objective_steps': a list of tuples where the first element of the tuple is the name of the step and the second element is the object that implements the step.
'study': the Study object from Optuna.
'direction': the direction of the optimization (i.e. maximize or minimize).
'save_top_n': the number of top pipelines to save.
'trial_timeout': the number of seconds the trial has to run.

Here is an example of a custom objective function:

from deepmol.pipeline_optimization.objective_wrapper import Objective

class MyObjective(Objective):
    """
    
    Parameters
    ----------
    objective_steps : callable
        Function that returns the steps of the pipeline for a given trial.
    study : optuna.study.Study
        Study object that stores the optimization history.
    direction : str or optuna.study.StudyDirection
        Direction of the optimization (minimize or maximize).
    save_top_n : int
        Number of best pipelines to save.
    **kwargs
        Additional keyword arguments passed to the objective_steps function.
    """

    def __init__(self, objective_steps, study, direction, save_top_n,
                 trial_timeout=86400, **kwargs):
        """
        Initialize the objective function.
        """
        super().__init__(objective_steps, study, direction, save_top_n, trial_timeout=trial_timeout)
        self.metric = kwargs.pop('metric')
        # the datasets or any additional data can be accessed through kwargs. Here is an example:

        self.train_dataset = kwargs.pop('train_dataset')
        self.test_dataset = kwargs.pop('test_dataset')

        # and passed as input of the PipelineOptimization().optimize(data=data). See below in the example of optimization calling.

        self.kwargs = kwargs

    def _run(self, trial: Trial):
        """
        Call the objective function.
        """
        # train the pipeline 
        # compute a score using self.metric for the test/validation set
        return score


po = PipelineOptimization(direction='minimize', study_name='test_pipeline', sampler=optuna.samplers.TPESampler(seed=42),
                          storage='sqlite:///my_experience.db')
metric = Metric(mean_squared_error)
train, test = RandomSplitter().train_test_split(dataset_regression, seed=123)
po.optimize(train_dataset=train, test_dataset=test, objective_steps='all', 
            metric=metric, n_trials=10, data=train, save_top_n=2, trial_timeout=600, 
            objective = MyObjective)

One example is given in here.