deepmol.splitters package
Submodules
deepmol.splitters.multitask_splitter module
- class MultiTaskStratifiedSplitter[source]
Bases:
Splitter
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Split the dataset into k folds using stratified sampling.
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]] [source]
Splits a dataset into train/validation/test using a stratified split.
- Parameters:
dataset (Dataset) – Dataset to split
frac_train (float) – Fraction of dataset to use for training
frac_valid (float) – Fraction of dataset to use for validation
frac_test (float) – Fraction of dataset to use for testing
seed (int) – Seed for the random number generator
kwargs –
- Returns:
train_indexes (List[int]) – Indexes of the training set
valid_indexes (List[int]) – Indexes of the validation set
test_indexes (List[int]) – Indexes of the test set
deepmol.splitters.splitters module
- class ButinaSplitter(cutoff: float = 0.6)[source]
Bases:
Splitter
Splitter based on the Butina clustering algorithm.
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Splits the dataset into k folds based on Butina splitter.
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_datasets: bool = True) Tuple[List[int], List[int], List[int]] [source]
Splits internal compounds into train and validation based on the butina clustering algorithm. The dataset is expected to be a classification dataset. This algorithm is designed to generate validation data that are novel chemotypes. Setting a small cutoff value will generate smaller, finer clusters of high similarity, whereas setting a large cutoff value will generate larger, coarser clusters of low similarity.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
frac_valid (float) – The fraction of data to be used for the validation split.
frac_test (float) – The fraction of data to be used for the test split.
seed (int) – Random seed to use.
homogenous_datasets (bool) – Whether the datasets will be homogenous or not.
- Returns:
A tuple of train indices, valid indices, and test indices.
- Return type:
Tuple[List[int], List[int], List[int]]
- class RandomSplitter[source]
Bases:
Splitter
Class for doing random data splits.
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Split a dataset into k folds for cross-validation.
- Parameters:
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
seed (int, optional) – Random seed to use for reproducibility.
- Returns:
List of length k tuples of (train, test) where train and test are both Dataset.
- Return type:
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]] [source]
Splits randomly into train/validation/test.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
frac_valid (float) – The fraction of data to be used for the validation split.
frac_test (float) – The fraction of data to be used for the test split.
seed (int) – Random seed to use.
**kwargs (Dict[str, Any]) – Other arguments.
- Returns:
A tuple of train indices, valid indices, and test indices.
- Return type:
Tuple[List[int], List[int], List[int]]
- class ScaffoldSplitter[source]
Bases:
Splitter
Class for splitting the dataset based on scaffolds.
- static generate_scaffolds(mols: ndarray, indexes: List[int]) List[List[int]] [source]
Returns all scaffolds from the dataset.
- Parameters:
mols (List[Mol]) – List of rdkit Mol objects for scaffold generation
indexes (List[int]) – Molecules’ indexes.
- Returns:
scaffold_sets – List of indices of each scaffold in the dataset.
- Return type:
List[List[int]]
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Splits the dataset into k folds based on scaffolds.
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_datasets: bool = True) Tuple[List[int], List[int], List[int]] [source]
Splits internal compounds into train/validation/test by scaffold.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
frac_valid (float) – The fraction of data to be used for the validation split.
frac_test (float) – The fraction of data to be used for the test split.
seed (int) – Random seed to use.
homogenous_datasets (bool) – Whether the datasets will be homogenous or not.
- Returns:
A tuple of train indices, valid indices, and test indices.
- Return type:
Tuple[List[int], List[int], List[int]]
- class SimilaritySplitter[source]
Bases:
Splitter
Class for doing data splits based on fingerprint similarity.
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Splits the dataset into k folds based on similarity.
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, homogenous_threshold: float = 0.7) Tuple[List[int], List[int], List[int]] [source]
Splits compounds into train/validation/test based on similarity. It can generate both homogenous and heterogeneous train and test sets.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – Fraction of dataset put into training data.
frac_valid (float) – Fraction of dataset put into validation data.
frac_test (float) – Fraction of dataset put into test data.
seed (int) – Random seed to use.
homogenous_threshold (float) – Threshold for similarity, all the compounds with a similarity lower than this threshold will be separated in the training set and test set. The higher the threshold is, the more heterogeneous the split will be.
- Returns:
A tuple of train indices, valid indices, and test indices.
- Return type:
Tuple[List[int], List[int], List[int]]
- class SingletaskStratifiedSplitter[source]
Bases:
Splitter
Class for doing data splits by stratification on a single task.
- k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Splits compounds into k-folds using stratified sampling.
- Parameters:
dataset (Dataset) – Dataset to be split.
k (int) – Number of folds to split dataset into.
seed (int) – Random seed to use.
- Returns:
fold_datasets – A list of length k of tuples of train and test datasets as NumpyDataset objects.
- Return type:
List[Tuple[NumpyDataset, NumpyDataset]]:
- split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, force_split: bool = False, **kwargs) Tuple[List[int], List[int], List[int]] [source]
Splits compounds into train/validation/test using stratified sampling.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – Fraction of dataset put into training data.
frac_valid (float) – Fraction of dataset put into validation data.
frac_test (float) – Fraction of dataset put into test data.
seed (int) – Random seed to use.
force_split (bool) – If True, will force the split without checking if it is a regression or classification label.
- Returns:
A tuple of train indices, valid indices, and test indices.
- Return type:
Tuple[List[int], List[int], List[int]]
- class Splitter[source]
Bases:
ABC
Splitters split up datasets into pieces for training/validation/testing. In machine learning applications, it’s often necessary to split up a dataset into training/validation/test sets. Or to k-fold split a dataset for cross-validation.
- abstract k_fold_split(dataset: Dataset, k: int, seed: int | None = None) List[Tuple[Dataset, Dataset]] [source]
Split a dataset into k folds for cross-validation.
- Parameters:
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
seed (int, optional) – Random seed to use for reproducibility.
- Returns:
List of length k tuples of (train, test) where train and test are both Dataset.
- Return type:
- abstract split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: int | None = None, **kwargs) Tuple[List[int], List[int], List[int]] [source]
Return indices for specified splits.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
frac_valid (float) – The fraction of data to be used for the validation split.
frac_test (float) – The fraction of data to be used for the test split.
seed (int) – Random seed to use.
**kwargs (Dict[str, Any]) – Other arguments.
- Returns:
A tuple (train_inds, valid_inds, test_inds) of the indices for the various splits.
- Return type:
Tuple[List[int], List[int], List[int]]
- train_test_split(dataset: Dataset, frac_train: float = 0.8, seed: int | None = None, **kwargs) Tuple[Dataset, Dataset] [source]
Splits self into train/test sets. Returns Dataset objects for train/test.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
seed (int) – Random seed to use.
**kwargs (Dict[str, Any]) – Other arguments.
- Returns:
A tuple of train and test datasets as Dataset objects.
- Return type:
- train_valid_test_split(dataset: Dataset, frac_train: float = 0.8, frac_valid: float | None = None, frac_test: float | None = None, seed: int | None = None, **kwargs) Tuple[Dataset, Dataset, Dataset] [source]
Splits a Dataset into train/validation/test sets. Returns Dataset objects for train, valid, test.
- Parameters:
dataset (Dataset) – Dataset to be split.
frac_train (float) – The fraction of data to be used for the training split.
frac_valid (float) – The fraction of data to be used for the validation split.
frac_test (float) – The fraction of data to be used for the test split.
seed (int) – Random seed to use.
**kwargs (Dict[str, Any]) – Other arguments.
- Returns:
A tuple of train, valid and test datasets as Dataset objects.
- Return type: