Data Scaling with DeepMol

In machine learning and deep learning, scaling is important because it can significantly affect the performance of the algorithms. This is also true in the field of chemoinformatics, which involves the use of machine learning and other computational methods to analyze chemical data.

One reason why scaling is important is that many machine learning algorithms use distance-based measures to calculate similarities between data points. If the features of the data are not scaled, the algorithm may give more weight to features with larger values, even if they are not more important for the analysis. This can lead to biased results and suboptimal model performance.

Another reason why scaling is important is that it can help to speed up the training process. When the features of the data are not scaled, the optimization algorithm used in training may take longer to converge or may even fail to converge at all. This can be especially problematic in deep learning, where large amounts of data and complex models are often used.

As we will see below, DeepMol offers a wide variety of scalers.

Let’s start by loading some data

from deepmol.compound_featurization import TwoDimensionDescriptors
from deepmol.loaders import CSVLoader

# Load data from CSV file
loader = CSVLoader(dataset_path='../data/CHEMBL217_reduced.csv',
                   smiles_field='SMILES',
                   id_field='Original_Entry_ID',
                   labels_fields=['Activity_Flag'],
                   mode='auto',
                   shard_size=2500)
# create the dataset
data = loader.create_dataset(sep=',', header=0)
# create the features
TwoDimensionDescriptors().featurize(data, inplace=True)

# view the features
data.X # data is very heterogeneous

array([[ 1.2918277e+01,  1.7210670e-02,  1.2918277e+01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 1.3306055e+01, -5.6191501e-03,  1.3306055e+01, ...,
         0.0000000e+00,  0.0000000e+00,  1.0000000e+00],
       [ 1.1760469e+01, -2.3133385e-01,  1.1760469e+01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       ...,
       [ 1.2702195e+01, -3.6049552e+00,  1.2702195e+01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 2.4115279e+00,  4.3334645e-01,  2.4115279e+00, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 6.1730037e+00,  6.9678569e-01,  6.1730037e+00, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00]], dtype=float32)

StandardScaler

Standardize features by removing the mean and scaling to unit variance.

from copy import deepcopy
from deepmol.scalers import StandardScaler

d1 = deepcopy(data)
scaler = StandardScaler() # Standardize features by removing the mean and scaling to unit variance.
d1 = scaler.fit_transform(d1) # you can scale only a portion of the data by passing a columns argument with the indexes of the columns to scale
d1.X # the data is much more homogeneous

array([[ 0.6015701 ,  0.5215607 ,  0.6015701 , ..., -0.2591845 ,
        -0.32493442, -0.24450661],
       [ 0.7382216 ,  0.5070287 ,  0.7382216 , ..., -0.2591845 ,
        -0.32493442,  4.0300846 ],
       [ 0.19356354,  0.36335236,  0.19356354, ..., -0.2591845 ,
        -0.32493442, -0.24450661],
       ...,
       [ 0.5254238 , -1.7840909 ,  0.5254238 , ..., -0.2591845 ,
        -0.32493442, -0.24450661],
       [-3.1009648 ,  0.78644764, -3.1009648 , ..., -0.2591845 ,
        -0.32493442, -0.24450661],
       [-1.7754363 ,  0.9541371 , -1.7754363 , ..., -0.2591845 ,
        -0.32493442, -0.24450661]], dtype=float32)

MinMaxScaler

Transform features by scaling each feature to a given range.

from deepmol.scalers import MinMaxScaler
from copy import deepcopy

d2 = deepcopy(data)
scaler = MinMaxScaler(feature_range=(-2, 2))
d2 = scaler.fit_transform(d2)
d2.X # data is scaled between -2 and 2

array([[ 1.2589104 ,  1.5757873 ,  1.2589104 , ..., -2.        ,
        -2.        , -2.        ],
       [ 1.3791888 ,  1.567372  ,  1.3791888 , ..., -2.        ,
        -2.        ,  0.        ],
       [ 0.8997896 ,  1.4841713 ,  0.8997896 , ..., -2.        ,
        -2.        , -2.        ],
       ...,
       [ 1.1918876 ,  0.24062085,  1.1918876 , ..., -2.        ,
        -2.        , -2.        ],
       [-2.        ,  1.729179  , -2.        , ..., -2.        ,
        -2.        , -2.        ],
       [-0.83329153,  1.8262854 , -0.83329153, ..., -2.        ,
        -2.        , -2.        ]], dtype=float32)

MaxAbsScaler

Scale each feature by its maximum absolute value.

from deepmol.scalers import MaxAbsScaler
from copy import deepcopy

d3 = deepcopy(data)
scaler = MaxAbsScaler()
d3 = scaler.fit_transform(d3)
d3.X

array([[ 8.4391510e-01,  1.7773149e-03,  8.4391510e-01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 8.6924756e-01, -5.8027951e-04,  8.6924756e-01, ...,
         0.0000000e+00,  0.0000000e+00,  5.0000000e-01],
       [ 7.6827878e-01, -2.3889430e-02,  7.6827878e-01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       ...,
       [ 8.2979906e-01, -3.7227723e-01,  8.2979906e-01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 1.5753841e-01,  4.4750907e-02,  1.5753841e-01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 4.0326515e-01,  7.1955800e-02,  4.0326515e-01, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00]], dtype=float32)

RobustScaler

Scale features using statistics that are robust to outliers.

from deepmol.scalers import RobustScaler
from copy import deepcopy

d4 = deepcopy(data)
scaler = RobustScaler()
d4 = scaler.fit_transform(d4)
d4.X # scaled data

array([[ 0.22212751,  0.29477584,  0.22212751, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.3872069 ,  0.26957947,  0.3872069 , ...,  0.        ,
         0.        ,  1.        ],
       [-0.27075756,  0.02046694, -0.27075756, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.13014036, -3.7028677 ,  0.13014036, ...,  0.        ,
         0.        ,  0.        ],
       [-4.250654  ,  0.75404876, -4.250654  , ...,  0.        ,
         0.        ,  0.        ],
       [-2.649373  ,  1.0447963 , -2.649373  , ...,  0.        ,
         0.        ,  0.        ]], dtype=float32)

Normalizer scaler

Normalize samples individually to unit norm.

from deepmol.scalers import Normalizer
from copy import deepcopy

d5 = deepcopy(data)
scaler = Normalizer(norm='l2') # One of 'l1', 'l2' or 'max'. The norm to use to normalize each non-zero sample.
d5 = scaler.fit_transform(d5)
d5.X # scaled data

array([[ 2.4346070e-07,  3.2435607e-10,  2.4346070e-07, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 3.5423795e-08, -1.4959476e-11,  3.5423795e-08, ...,
         0.0000000e+00,  0.0000000e+00,  2.6622313e-09],
       [ 7.4276683e-04, -1.4610566e-05,  7.4276683e-04, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       ...,
       [ 2.1571793e-06, -6.1221971e-07,  2.1571793e-06, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 7.5876460e-06,  1.3634839e-06,  7.5876460e-06, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
       [ 1.9041399e-05,  2.1493222e-06,  1.9041399e-05, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00]], dtype=float32)

Binarizer scaler

Binarize data (set feature values to 0 or 1) according to a threshold.

from deepmol.scalers import Binarizer
from copy import deepcopy

d6 = deepcopy(data)
scaler = Binarizer(threshold=1) # features higher than 10 are set to 1, features lower than 10 are set to 0
d6 = scaler.fit_transform(d6)
d6.X

array([[1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.]], dtype=float32)

QuantileTransformer

The QuantileTransformer is a preprocessing method that transforms input data to have a specified probability distribution. This function maps the data to a uniform or normal distribution using the quantiles of the input data.

This transformer is often useful when working with machine learning algorithms that are sensitive to the scale and distribution of the input data, such as neural networks. The QuantileTransformer is particularly useful when the input data has a highly skewed distribution, as it can transform the data to a more Gaussian distribution.

from deepmol.scalers import QuantileTransformer
from copy import deepcopy

d7 = deepcopy(data)
scaler = QuantileTransformer()
d7 = scaler.fit_transform(d7)
d7.X # scale data

array([[0.72686756, 0.7179994 , 0.72686756, ..., 0.        , 0.        ,
        0.        ],
       [0.86711156, 0.7053635 , 0.86711156, ..., 0.        , 0.        ,
        0.9714715 ],
       [0.32533428, 0.51655453, 0.32533428, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.62756836, 0.12748909, 0.62756836, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.8474799 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14120819, 0.92178524, 0.14120819, ..., 0.        , 0.        ,
        0.        ]], dtype=float32)

PowerTransformer

The PowerTransformer is a preprocessing function in scikit-learn that applies a power transformation to make the data more Gaussian-like. It can be used for data with skewed distributions, as well as data that has a linear relationship with the target variable.

The PowerTransformer applies either a Box-Cox transformation or a Yeo-Johnson transformation to the input data. The Box-Cox transformation is only applicable to positive data, while the Yeo-Johnson transformation can be applied to both positive and negative data.

from deepmol.scalers import PowerTransformer
from copy import deepcopy

d8 = deepcopy(data)
scaler = PowerTransformer(method='yeo-johnson', # The power transform method. Available methods are: 'yeo-johnson', works with positive and negative values; box-cox', only works with strictly positive values
                          standardize=True) # apply zero mean, unit variance normalization to the transformed output
d8 = scaler.fit_transform(d8)
d8.X # scaled data

array([[ 0.6452927 ,  0.36367184,  0.6452927 , ..., -0.26236013,
        -0.4512646 , -0.24539872],
       [ 0.93890953,  0.33411208,  0.93890953, ..., -0.26236013,
        -0.4512646 ,  4.075001  ],
       [-0.0993756 ,  0.07504444, -0.0993756 , ..., -0.26236013,
        -0.4512646 , -0.24539872],
       ...,
       [ 0.49173132, -1.5438824 ,  0.49173132, ..., -0.26236013,
        -0.4512646 , -0.24539872],
       [-1.921834  ,  1.0273771 , -1.921834  , ..., -0.26236013,
        -0.4512646 , -0.24539872],
       [-1.7411773 ,  1.5712001 , -1.7411773 , ..., -0.26236013,
        -0.4512646 , -0.24539872]], dtype=float32)