deepmol.datasets package
Submodules
deepmol.datasets.datasets module
- class Dataset[source]
Bases:
ABC
Abstract base class for datasets Subclasses need to implement their own methods based on this class.
- abstract property X: ndarray
Get the features in the dataset.
- Returns:
X – The features in the dataset.
- Return type:
np.ndarray
- abstract property feature_names: ndarray
Get the feature labels of the molecules in the dataset.
- Returns:
feature_names – Feature names of the molecules.
- Return type:
np.ndarray
- abstract get_shape() tuple [source]
Get the shape of molecules, features and labels in the dataset.
- Returns:
shape – The shape of molecules, features and labels.
- Return type:
tuple
- abstract property ids: ndarray
Get the ids in the dataset.
- Returns:
ids – The ids in the dataset.
- Return type:
np.ndarray
- abstract property label_names: ndarray
Get the labels names of the dataset. If you have a single task this will be a list of length 1 with the name of the label. If you have a multi-task dataset this will be a list of length n_tasks with the names of the labels.
- Returns:
label_names – Label names of the molecules.
- Return type:
np.ndarray
- abstract property mode: str | List[str]
Get the mode of the dataset.
- Returns:
mode – The mode of the dataset.
- Return type:
Union[str, List[str]]
- abstract property mols: ndarray
Get the molecules in the dataset.
- Returns:
mols – Molecules in the dataset.
- Return type:
np.ndarray
- abstract property n_tasks: int
Get the number of tasks in the dataset.
- Returns:
n_tasks – The number of tasks in the dataset.
- Return type:
int
- abstract remove_elements(indexes: List) None [source]
Remove the elements from the dataset.
- Parameters:
indexes (List[int]) – The indexes of the elements to remove.
- abstract remove_nan(axis: int = 0) None [source]
Remove the nan values from the dataset.
- Parameters:
axis (int) – The axis to remove the nan values.
- abstract property removed_elements: ndarray
Get the molecules in the dataset.
- Returns:
mols – Removed molecules in the dataset.
- Return type:
np.ndarray
- abstract select(indexes: List[int], axis: int = 0) None [source]
Select the elements from the dataset.
- Parameters:
indexes (List[int]) – The indexes of the elements to select.
axis (int) – The axis to select the elements.
- abstract select_features_by_index(indexes: List[int]) Dataset [source]
Select the features from the dataset. :param indexes: The indexes of the features to select. :type indexes: List[int]
- abstract select_features_by_name(names: List[str]) None [source]
Select features with specific names from the dataset :param names: The names of the features to select from the dataset. :type names: List[str]
- abstract select_to_split(indexes: ndarray | List[int]) Dataset [source]
Select the elements from the dataset to split.
- Parameters:
indexes (Union[np.ndarray, List[int]]) – The indexes of the elements to select.
- abstract property smiles: ndarray
Get the smiles in the dataset. :returns: mols – Molecule smiles in the dataset. :rtype: np.ndarray
- abstract property y: ndarray
Get the labels in the dataset.
- Returns:
y – The labels in the dataset.
- Return type:
np.ndarray
- class SmilesDataset(smiles: ndarray | List[str], mols: ndarray | List[Mol] | None = None, ids: List | ndarray | None = None, X: List | ndarray | None = None, feature_names: List | ndarray | None = None, y: List | ndarray | None = None, label_names: List | ndarray | None = None, mode: str | List[str] = 'auto')[source]
Bases:
Dataset
A Dataset defined by in-memory numpy arrays. This subclass of ‘Dataset’ stores arrays for smiles strings, Mol objects, features X, labels y, and molecule ids in memory as numpy arrays.
- X
Get the features of the molecules in the dataset. :returns: Features of the molecules in the dataset. :rtype: np.ndarray
- property feature_names: ndarray
Get the feature labels of the molecules in the dataset. :returns: Feature names of the molecules in the dataset. :rtype: np.ndarray
- classmethod from_mols(mols: ndarray | List[Mol], ids: List | ndarray | None = None, X: List | ndarray | None = None, feature_names: List | ndarray | None = None, y: List | ndarray | None = None, label_names: List | ndarray | None = None, mode: str = 'auto') SmilesDataset [source]
Initialize a dataset from RDKit Mol objects.
- Parameters:
mols (Union[np.ndarray, List[Mol]]) – RDKit Mol objects of the molecules.
ids (Union[List, np.ndarray]) – IDs of the molecules.
X (Union[List, np.ndarray]) – Features of the molecules.
feature_names (Union[List, np.ndarray]) – Names of the features.
y (Union[List, np.ndarray]) – Labels of the molecules.
label_names (Union[List, np.ndarray]) – Names of the labels. If you have a single task this will be a list of length 1 with the name of the label. If you have a multi-task dataset this will be a list of length n_tasks with the names of the labels.
mode (str) – The mode of the dataset. If ‘auto’, the mode is inferred from the labels. If ‘classification’, the dataset is treated as a classification dataset. If ‘regression’, the dataset is treated as a regression dataset. If ‘multitask’, the dataset is treated as a multitask dataset.
- Returns:
The dataset instance.
- Return type:
- get_shape() Tuple[Tuple, Tuple | None, Tuple | None] [source]
Get the shape of the dataset. Returns three tuples, giving the shape of the smiles, X and y arrays.
- Returns:
smiles_shape (Tuple) – The shape of the mols array.
X_shape (Union[Tuple, None]) – The shape of the X array.
y_shape (Union[Tuple, None]) – The shape of the y array.
- property ids: ndarray
Get the IDs of the molecules in the dataset. :returns: IDs of the molecules in the dataset. :rtype: np.ndarray
- property label_names: ndarray
Get the label names of the molecules in the dataset. If you have a single task this will be a list of length 1 with the name of the label. If you have a multi-task dataset this will be a list of length n_tasks with the names of the labels.
- Returns:
Label names in the dataset.
- Return type:
np.ndarray
- load_features(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- merge(datasets: List[Dataset]) SmilesDataset [source]
Merges provided datasets with the self dataset. :param datasets: List of datasets to merge. :type datasets: List[Dataset]
- Returns:
A merged NumpyDataset.
- Return type:
NumpyDataset
- property mode: str | List[str]
Get the mode of the dataset. :returns: mode – The mode of the dataset. :rtype: Union[str, List[str]]
- property mols: ndarray
Get the RDKit Mol objects of the molecules in the dataset. :returns: RDKit molecules of the molecules in the dataset. :rtype: np.ndarray
- property n_tasks: int
Get the number of tasks in the dataset.
- Returns:
n_tasks – The number of tasks in the dataset.
- Return type:
int
- remove_duplicates(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- remove_elements(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- remove_elements_by_index(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- remove_nan(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- property removed_elements: ndarray
Get the molecules in the dataset.
- Returns:
mols – Removed molecules in the dataset.
- Return type:
np.ndarray
- save_features(path: str = 'features.csv') None [source]
Save the features to a csv file. :param path: Path to save the csv file. :type path: str
- select(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- select_features_by_index(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- select_features_by_name(*args, inplace=False, **kwargs)
Method to make inplace.
- Parameters:
self (object) – Object to apply the method to.
args (list) – Arguments to pass to the method.
inplace (bool) – Whether to apply the method inplace.
kwargs (dict) – Keyword arguments to pass to the method.
- Returns:
result – Result of the method.
- Return type:
object
- select_to_split(indexes: ndarray | List[int]) SmilesDataset [source]
Select elements with specific indexes to split the dataset :param indexes: The indexes of the elements to split the dataset. :type indexes: Union[np.ndarray, List[int]]
- Returns:
The dataset with the selected elements.
- Return type:
- property smiles: ndarray
Get the SMILES strings of the molecules in the dataset. :returns: SMILES strings of the molecules in the dataset. :rtype: np.ndarray
- to_csv(path: str, **kwargs) None [source]
Save the dataset to a csv file. :param path: Path to save the csv file. :type path: str
- to_sdf(path: str) None [source]
Save the dataset to a sdf file. :param path: Path to save the sdf file. :type path: str
- property y: ndarray
Get the labels of the molecules in the dataset. :returns: Labels of the molecules in the dataset. :rtype: np.ndarray