deepmol.datasets package

Submodules

deepmol.datasets.datasets module

class Dataset[source]

Bases: ABC

Abstract base class for datasets Subclasses need to implement their own methods based on this class.

abstract property X: ndarray

Get the features in the dataset.

Returns

X – The features in the dataset.

Return type

np.ndarray

abstract property feature_names: ndarray

Get the feature labels of the molecules in the dataset.

Returns

feature_names – Feature names of the molecules.

Return type

np.ndarray

abstract get_shape() tuple[source]

Get the shape of molecules, features and labels in the dataset.

Returns

shape – The shape of molecules, features and labels.

Return type

tuple

abstract property ids: ndarray

Get the ids in the dataset.

Returns

ids – The ids in the dataset.

Return type

np.ndarray

abstract property label_names: ndarray

Get the labels names of the molecules in the dataset.

Returns

label_names – Label names of the molecules.

Return type

np.ndarray

abstract property mode: str

Get the mode of the dataset.

Returns

mode – The mode of the dataset.

Return type

str

abstract property mols: ndarray

Get the molecules in the dataset.

Returns

mols – Molecules in the dataset.

Return type

np.ndarray

abstract property n_tasks: int

Get the number of tasks in the dataset.

Returns

n_tasks – The number of tasks in the dataset.

Return type

int

abstract remove_elements(indexes: List) None[source]

Remove the elements from the dataset.

Parameters

indexes (List[int]) – The indexes of the elements to remove.

abstract remove_nan(axis: int = 0) None[source]

Remove the nan values from the dataset.

Parameters

axis (int) – The axis to remove the nan values.

abstract select(indexes: List[int], axis: int = 0) None[source]

Select the elements from the dataset.

Parameters
  • indexes (List[int]) – The indexes of the elements to select.

  • axis (int) – The axis to select the elements.

abstract select_features_by_index(indexes: List[int]) None[source]

Select the features from the dataset. :param indexes: The indexes of the features to select. :type indexes: List[int]

abstract select_features_by_name(names: List[str]) None[source]

Select features with specific names from the dataset :param names: The names of the features to select from the dataset. :type names: List[str]

abstract select_to_split(indexes: Union[ndarray, List[int]]) Dataset[source]

Select the elements from the dataset to split.

Parameters

indexes (Union[np.ndarray, List[int]]) – The indexes of the elements to select.

abstract property smiles: ndarray

Get the smiles in the dataset. :returns: mols – Molecule smiles in the dataset. :rtype: np.ndarray

abstract property y: ndarray

Get the labels in the dataset.

Returns

y – The labels in the dataset.

Return type

np.ndarray

class SmilesDataset(smiles: Union[ndarray, List[str]], mols: Optional[Union[ndarray, List[Mol]]] = None, ids: Optional[Union[List, ndarray]] = None, X: Optional[Union[List, ndarray]] = None, feature_names: Optional[Union[List, ndarray]] = None, y: Optional[Union[List, ndarray]] = None, label_names: Optional[Union[List, ndarray]] = None, mode: str = 'auto')[source]

Bases: Dataset

A Dataset defined by in-memory numpy arrays. This subclass of ‘Dataset’ stores arrays for smiles strings, Mol objects, features X, labels y, and molecule ids in memory as numpy arrays.

property X: ndarray

Get the features of the molecules in the dataset. :returns: Features of the molecules in the dataset. :rtype: np.ndarray

property feature_names: ndarray

Get the feature labels of the molecules in the dataset. :returns: Feature names of the molecules in the dataset. :rtype: np.ndarray

classmethod from_mols(mols: Union[ndarray, List[Mol]], ids: Optional[Union[List, ndarray]] = None, X: Optional[Union[List, ndarray]] = None, feature_names: Optional[Union[List, ndarray]] = None, y: Optional[Union[List, ndarray]] = None, label_names: Optional[Union[List, ndarray]] = None, mode: str = 'auto') SmilesDataset[source]

Initialize a dataset from RDKit Mol objects.

Parameters
  • mols (Union[np.ndarray, List[Mol]]) – RDKit Mol objects of the molecules.

  • ids (Union[List, np.ndarray]) – IDs of the molecules.

  • X (Union[List, np.ndarray]) – Features of the molecules.

  • feature_names (Union[List, np.ndarray]) – Names of the features.

  • y (Union[List, np.ndarray]) – Labels of the molecules.

  • label_names (Union[List, np.ndarray]) – Names of the labels.

  • mode (str) – The mode of the dataset. If ‘auto’, the mode is inferred from the labels. If ‘classification’, the dataset is treated as a classification dataset. If ‘regression’, the dataset is treated as a regression dataset. If ‘multitask’, the dataset is treated as a multitask dataset.

Returns

The dataset instance.

Return type

SmilesDataset

get_shape() Tuple[Tuple, Optional[Tuple], Optional[Tuple]][source]

Get the shape of the dataset. Returns three tuples, giving the shape of the smiles, X and y arrays.

Returns

  • smiles_shape (Tuple) – The shape of the mols array.

  • X_shape (Union[Tuple, None]) – The shape of the X array.

  • y_shape (Union[Tuple, None]) – The shape of the y array.

property ids: ndarray

Get the IDs of the molecules in the dataset. :returns: IDs of the molecules in the dataset. :rtype: np.ndarray

property label_names: ndarray

Get the label names of the molecules in the dataset. :returns: Label names of the molecules in the dataset. :rtype: np.ndarray

load_features(path: str, **kwargs) None[source]

Load features from a csv file. :param path: Path to the csv file. :type path: str :param kwargs: Keyword arguments to pass to pandas.read_csv.

merge(datasets: List[Dataset]) SmilesDataset[source]

Merges provided datasets with the self dataset. :param datasets: List of datasets to merge. :type datasets: List[Dataset]

Returns

A merged NumpyDataset.

Return type

NumpyDataset

property mode: str

Get the mode of the dataset. :returns: The mode of the dataset. :rtype: str

property mols: ndarray

Get the RDKit Mol objects of the molecules in the dataset. :returns: RDKit molecules of the molecules in the dataset. :rtype: np.ndarray

property n_tasks: int

Get the number of tasks in the dataset.

Returns

n_tasks – The number of tasks in the dataset.

Return type

int

remove_duplicates() None[source]

Remove molecules with duplicated features from the dataset.

remove_elements(ids: List[str]) None[source]

Remove elements with specific IDs from the dataset. :param ids: IDs of the elements to remove. :type ids: List[str]

remove_elements_by_index(indexes: List[int]) None[source]

Remove elements with specific indexes from the dataset. :param indexes: Indexes of the elements to remove. :type indexes: List[int]

remove_nan(axis: int = 0) None[source]

Remove samples with at least one NaN in the features (when axis = 0) Or remove samples with all features with NaNs and the features with at least one NaN (axis = 1) :param axis: The axis to remove the NaNs from. :type axis: int

save_features(path: str = 'features.csv') None[source]

Save the features to a csv file. :param path: Path to save the csv file. :type path: str

select(ids: Union[List[str], List[int]], axis: int = 0) None[source]

Creates a new sub dataset of self from a selection of indexes.

Parameters
  • ids (Union[List[str], List[int]]) – List of ids/indexes to select.IDs of the compounds in case axis = 0, indexes of the columns in case axis = 1.

  • axis (int) – Axis to select along. 0 selects along the first axis, 1 selects along the second axis.

select_features_by_index(indexes: List[int]) None[source]

Select features with specific indexes from the dataset :param indexes: The indexes of the features to select from the dataset. :type indexes: List[int]

select_features_by_name(names: List[str]) None[source]

Select features with specific names from the dataset :param names: The names of the features to select from the dataset. :type names: List[str]

select_to_split(indexes: Union[ndarray, List[int]]) SmilesDataset[source]

Select elements with specific indexes to split the dataset :param indexes: The indexes of the elements to split the dataset. :type indexes: Union[np.ndarray, List[int]]

Returns

The dataset with the selected elements.

Return type

SmilesDataset

property smiles: ndarray

Get the SMILES strings of the molecules in the dataset. :returns: SMILES strings of the molecules in the dataset. :rtype: np.ndarray

to_csv(path: str) None[source]

Save the dataset to a csv file. :param path: Path to save the csv file. :type path: str

property y: ndarray

Get the labels of the molecules in the dataset. :returns: Labels of the molecules in the dataset. :rtype: np.ndarray

Module contents