deepmol.splitters package

Submodules

deepmol.splitters.multitask_splitter module

class MultiTaskStratifiedSplitter[source]

Bases: Splitter

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Split the dataset into k folds using stratified sampling.

Parameters:
  • dataset (Dataset) – The dataset to split.

  • k (int) – The number of folds.

  • seed – The seed to use for the random number generator.

Returns:

folds – A list of tuples (train_dataset, test_dataset) containing the k folds.

Return type:

List[Tuple[Dataset, Dataset]]

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]][source]

Splits a dataset into train/validation/test using a stratified split.

Parameters:
  • dataset (Dataset) – Dataset to split

  • frac_train (float) – Fraction of dataset to use for training

  • frac_valid (float) – Fraction of dataset to use for validation

  • frac_test (float) – Fraction of dataset to use for testing

  • seed (int) – Seed for the random number generator

  • kwargs

Returns:

  • train_indexes (List[int]) – Indexes of the training set

  • valid_indexes (List[int]) – Indexes of the validation set

  • test_indexes (List[int]) – Indexes of the test set

deepmol.splitters.splitters module

class ButinaSplitter(cutoff: float = 0.6)[source]

Bases: Splitter

Splitter based on the Butina clustering algorithm.

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Splits the dataset into k folds based on Butina splitter.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • k (int) – Number of folds.

  • seed (int) – Random seed.

Returns:

List of train/test pairs of size k.

Return type:

List[Tuple[Dataset, Dataset]]

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_datasets: bool = True) Tuple[List[int], List[int], List[int]][source]

Splits internal compounds into train and validation based on the butina clustering algorithm. The dataset is expected to be a classification dataset. This algorithm is designed to generate validation data that are novel chemotypes. Setting a small cutoff value will generate smaller, finer clusters of high similarity, whereas setting a large cutoff value will generate larger, coarser clusters of low similarity.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • frac_valid (float) – The fraction of data to be used for the validation split.

  • frac_test (float) – The fraction of data to be used for the test split.

  • seed (int) – Random seed to use.

  • homogenous_datasets (bool) – Whether the datasets will be homogenous or not.

Returns:

A tuple of train indices, valid indices, and test indices.

Return type:

Tuple[List[int], List[int], List[int]]

class RandomSplitter[source]

Bases: Splitter

Class for doing random data splits.

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Split a dataset into k folds for cross-validation.

Parameters:
  • dataset (Dataset) – Dataset to do a k-fold split

  • k (int) – Number of folds to split dataset into.

  • seed (int, optional) – Random seed to use for reproducibility.

Returns:

List of length k tuples of (train, test) where train and test are both Dataset.

Return type:

List[Tuple[Dataset, Dataset]]

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]][source]

Splits randomly into train/validation/test.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • frac_valid (float) – The fraction of data to be used for the validation split.

  • frac_test (float) – The fraction of data to be used for the test split.

  • seed (int) – Random seed to use.

  • **kwargs (Dict[str, Any]) – Other arguments.

Returns:

A tuple of train indices, valid indices, and test indices.

Return type:

Tuple[List[int], List[int], List[int]]

class ScaffoldSplitter[source]

Bases: Splitter

Class for splitting the dataset based on scaffolds.

static generate_scaffolds(mols: ndarray, indexes: List[int]) List[List[int]][source]

Returns all scaffolds from the dataset.

Parameters:
  • mols (List[Mol]) – List of rdkit Mol objects for scaffold generation

  • indexes (List[int]) – Molecules’ indexes.

Returns:

scaffold_sets – List of indices of each scaffold in the dataset.

Return type:

List[List[int]]

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Splits the dataset into k folds based on scaffolds.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • k (int) – Number of folds.

  • seed (int) – Random seed.

Returns:

List of train/test pairs of size k.

Return type:

List[Tuple[Dataset, Dataset]]

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_datasets: bool = True) Tuple[List[int], List[int], List[int]][source]

Splits internal compounds into train/validation/test by scaffold.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • frac_valid (float) – The fraction of data to be used for the validation split.

  • frac_test (float) – The fraction of data to be used for the test split.

  • seed (int) – Random seed to use.

  • homogenous_datasets (bool) – Whether the datasets will be homogenous or not.

Returns:

A tuple of train indices, valid indices, and test indices.

Return type:

Tuple[List[int], List[int], List[int]]

class SimilaritySplitter[source]

Bases: Splitter

Class for doing data splits based on fingerprint similarity.

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Splits the dataset into k folds based on similarity.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • k (int) – Number of folds.

  • seed (int) – Random seed.

Returns:

List of train/test pairs of size k.

Return type:

List[Tuple[Dataset, Dataset]]

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_threshold: float = 0.7) Tuple[List[int], List[int], List[int]][source]

Splits compounds into train/validation/test based on similarity. It can generate both homogenous and heterogeneous train and test sets.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – Fraction of dataset put into training data.

  • frac_valid (float) – Fraction of dataset put into validation data.

  • frac_test (float) – Fraction of dataset put into test data.

  • seed (int) – Random seed to use.

  • homogenous_threshold (float) – Threshold for similarity, all the compounds with a similarity lower than this threshold will be separated in the training set and test set. The higher the threshold is, the more heterogeneous the split will be.

Returns:

A tuple of train indices, valid indices, and test indices.

Return type:

Tuple[List[int], List[int], List[int]]

class SingletaskStratifiedSplitter[source]

Bases: Splitter

Class for doing data splits by stratification on a single task.

k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Splits compounds into k-folds using stratified sampling.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • k (int) – Number of folds to split dataset into.

  • seed (int) – Random seed to use.

Returns:

fold_datasets – A list of length k of tuples of train and test datasets as NumpyDataset objects.

Return type:

List[Tuple[NumpyDataset, NumpyDataset]]:

split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, force_split: bool = False, **kwargs) Tuple[List[int], List[int], List[int]][source]

Splits compounds into train/validation/test using stratified sampling.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – Fraction of dataset put into training data.

  • frac_valid (float) – Fraction of dataset put into validation data.

  • frac_test (float) – Fraction of dataset put into test data.

  • seed (int) – Random seed to use.

  • force_split (bool) – If True, will force the split without checking if it is a regression or classification label.

Returns:

A tuple of train indices, valid indices, and test indices.

Return type:

Tuple[List[int], List[int], List[int]]

class Splitter[source]

Bases: ABC

Splitters split up datasets into pieces for training/validation/testing. In machine learning applications, it’s often necessary to split up a dataset into training/validation/test sets. Or to k-fold split a dataset for cross-validation.

abstract k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]][source]

Split a dataset into k folds for cross-validation.

Parameters:
  • dataset (Dataset) – Dataset to do a k-fold split

  • k (int) – Number of folds to split dataset into.

  • seed (int, optional) – Random seed to use for reproducibility.

Returns:

List of length k tuples of (train, test) where train and test are both Dataset.

Return type:

List[Tuple[Dataset, Dataset]]

abstract split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]][source]

Return indices for specified splits.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • frac_valid (float) – The fraction of data to be used for the validation split.

  • frac_test (float) – The fraction of data to be used for the test split.

  • seed (int) – Random seed to use.

  • **kwargs (Dict[str, Any]) – Other arguments.

Returns:

A tuple (train_inds, valid_inds, test_inds) of the indices for the various splits.

Return type:

Tuple[List[int], List[int], List[int]]

train_test_split(dataset: Dataset, frac_train: float = 0.8, seed: int | None = None, **kwargs) Tuple[Dataset, Dataset][source]

Splits self into train/test sets. Returns Dataset objects for train/test.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • seed (int) – Random seed to use.

  • **kwargs (Dict[str, Any]) – Other arguments.

Returns:

A tuple of train and test datasets as Dataset objects.

Return type:

Tuple[Dataset, Dataset]

train_valid_test_split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float | None = None, frac_test: float | None = None, seed: int | None = None, **kwargs) Tuple[Dataset, Dataset, Dataset][source]

Splits a Dataset into train/validation/test sets. Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (Dataset) – Dataset to be split.

  • frac_train (float) – The fraction of data to be used for the training split.

  • frac_valid (float) – The fraction of data to be used for the validation split.

  • frac_test (float) – The fraction of data to be used for the test split.

  • seed (int) – Random seed to use.

  • **kwargs (Dict[str, Any]) – Other arguments.

Returns:

A tuple of train, valid and test datasets as Dataset objects.

Return type:

Tuple[Dataset, Dataset, Dataset]

Module contents