deepmol.compound_featurization package

Submodules

deepmol.compound_featurization.base_featurizer module

class MolecularFeaturizer(n_jobs: int = -1)[source]

Bases: ABC

Abstract class for calculating a set of features for a molecule. A MolecularFeaturizer uses SMILES strings or RDKit molecule objects to represent molecules.

Subclasses need to implement the _featurize method for calculating features for a single molecule.

featurize(dataset: Dataset, scaler: Optional[BaseScaler] = None, path_to_save_scaler: Optional[str] = None, remove_nans_axis: int = 0) Dataset[source]

Calculate features for molecules.

Parameters
  • dataset (Dataset) – The dataset containing the molecules to featurize in dataset.mols.

  • scaler (BaseScaler) – The scaler to use for scaling the generated features.

  • path_to_save_scaler (str) – The path to save the scaler to.

  • remove_nans_axis (int) – The axis to remove NaNs from. If None, no NaNs are removed.

Returns

dataset – The input Dataset containing a featurized representation of the molecules in Dataset.X.

Return type

Dataset

deepmol.compound_featurization.deepchem_featurizers module

class ConvMolFeat(master_atom: bool = False, use_chirality: bool = False, atom_properties: Optional[List[str]] = None, per_atom_fragmentation: bool = False, **kwargs)[source]

Bases: MolecularFeaturizer

Duvenaud graph convolution, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#convmolfeaturizer). Vector of descriptors for each atom in a molecule. The featurizers computes that vector of local descriptors.

References: Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.

class CoulombEigFeat(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, max_conformers: int = 1, seed: Optional[int] = None, **kwargs)[source]

Bases: MolecularFeaturizer

Calculate the eigen values of Coulomb matrices for molecules. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#coulombmatrixeig).

References: Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

class CoulombFeat(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, max_conformers: int = 1, seed: Optional[int] = None, **kwargs)[source]

Bases: MolecularFeaturizer

Calculate coulomb matrices for molecules. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#coulombmatrix).

References: Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

class MolGanFeat(max_atom_count: int = 9, kekulize: bool = True, bond_labels: Optional[List[Any]] = None, atom_labels: Optional[List[int]] = None, **kwargs)[source]

Bases: MolecularFeaturizer

Featurizer for MolGAN de-novo molecular generation model, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html?highlight=CGCNN#molganfeaturizer). It is wrapper for two matrices containing atom and bond type information.

References: Nicola De Cao et al. “MolGAN: An implicit generative model for small molecular graphs” (2018), https://arxiv.org/abs/1805.11973

class MolGraphConvFeat(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False, **kwargs)[source]

Bases: MolecularFeaturizer

Featurizer of general graph convolution networks for molecules. Adapted from deepchem: (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#molgraphconvfeaturizer)

References: Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016):595-608.

class SmileImageFeat(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std', **kwargs)[source]

Bases: MolecularFeaturizer

Converts SMILE string to image. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#smilestoimage).

References: Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

class SmilesSeqFeat(char_to_idx: Optional[Dict[str, int]] = None, max_len: int = 250, pad_len: int = 10)[source]

Bases: object

Takes SMILES strings and turns into a sequence. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#smilestoseq).

References: Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

featurize(dataset: Dataset) Dataset[source]

Featurizes a single molecule.

Parameters

dataset (Dataset) – Dataset to featurize.

Returns

dataset – Featurized dataset.

Return type

Dataset

class WeaveFeat(graph_distance: bool = True, explicit_h: bool = False, use_chirality: bool = False, max_pair_distance: Optional[int] = None, **kwargs)[source]

Bases: MolecularFeaturizer

Weave convolution featurization, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#weavefeaturizer). Require a quadratic matrix of interaction descriptors for each pair of atoms.

References: Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.

deepmol.compound_featurization.mixed_descriptors module

class MixedFeaturizer(featurizers: Iterable[MolecularFeaturizer], **kwargs)[source]

Bases: MolecularFeaturizer

Class to perform multiple types of featurizers. Features from different featurizers are concatenated.

deepmol.compound_featurization.mol2vec module

class Mol2Vec(pretrain_model_path: Optional[str] = None, radius: int = 1, unseen: str = 'UNK', gather_method: str = 'sum')[source]

Bases: MolecularFeaturizer

Mol2Vec fingerprint implementation from https://doi.org/10.1021/acs.jcim.7b00616

Inspired by natural language processing techniques, Mol2vec, which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Mol2vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing the vectors of the individual substructures and, for instance, be fed into supervised machine learning approaches to predict compound properties.

sentences2vec(sentences: Iterable, model: Word2Vec, unseen: Optional[str] = None)[source]

Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.

Parameters
Returns

Array of vectors for each sentence.

Return type

np.array

deepmol.compound_featurization.rdkit_descriptors module

class All3DDescriptors(mandatory_generation_of_conformers=True)[source]

Bases: MolecularFeaturizer

Class to generate all three-dimensional descriptors.

class Asphericity(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate molecular Asphericity A. Baumgaertner, “Shapes of flexible vesicles” J. Chem. Phys. 98:7496 (1993) https://doi.org/10.1063/1.464689

class AutoCorr3D(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

AutoCorr3D. Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

class Eccentricity(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate molecular eccentricity G. A. Arteca “Molecular Shape Descriptors” Reviews in Computational Chemistry vol 9 https://doi.org/10.1002/9780470125861.ch5

class InertialShapeFactor(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate Inertial Shape Factor Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

class MORSE(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Molecule Representation of Structures based on Electron diffraction descriptors Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

class NormalizedPrincipalMomentsRatios(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Normalized principal moments ratios. Sauer and Schwarz JCIM 43:987-1003 (2003)

class PlaneOfBestFit(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Plane of best fit Nicholas C. Firth, Nathan Brown, and Julian Blagg, JCIM 52:2516-25

class PrincipalMomentsOfInertia(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate Principal Moments of Inertia

class RadialDistributionFunction(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Radial distribution function Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

class RadiusOfGyration(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate Radius of Gyration G. A. Arteca “Molecular Shape Descriptors” Reviews in Computational Chemistry vol 9 https://doi.org/10.1002/9780470125861.ch5

class SpherocityIndex(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

Calculate molecular Spherocity Index Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

class ThreeDimensionDescriptor(mandatory_generation_of_conformers: bool, **kwargs)[source]

Bases: MolecularFeaturizer

Class to generate three-dimensional descriptors.

property descriptor_function

Get the descriptor function.

generate_descriptor(mol)[source]

Generate the descriptors.

Parameters

mol (Mol) – Mol object from rdkit.

Returns

descriptors – Array with the descriptors.

Return type

np.ndarray

class ThreeDimensionalMoleculeGenerator(n_conformations: int = 5, max_iterations: int = 5, threads: int = 1, timeout_per_molecule: int = 40)[source]

Bases: object

Class to generate three-dimensional conformers and optimize them.

static check_if_mol_has_explicit_hydrogens(new_mol: Mol)[source]

Method to check if a molecule has explicit hydrogens.

Parameters

new_mol (Mol) – Mol object from rdkit.

Returns

True if molecule has explicit hydrogens and False if not.

Return type

bool

generate_conformers(new_mol: Mol, etkdg_version: int = 1, **kwargs)[source]

method to generate three-dimensional conformers

Parameters
  • new_mol (Mol) – Mol object from rdkit

  • etkdg_version (int) – version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm

  • kwargs (dict) – Parameters for the ETKDG algorithm.

Returns

new_mol – Mol object with three-dimensional conformers.

Return type

Mol

optimize_molecular_geometry(mol: Mol, mode: str = 'MMFF94')[source]

Class to generate three-dimensional conformers

Parameters
  • mol (Mol) – Mol object from rdkit.

  • mode (str) – mode for the molecular geometry optimization (MMFF or UFF variants).

Returns

mol – Mol object with optimized molecular geometry.

Return type

Mol

class TwoDimensionDescriptors(**kwargs)[source]

Bases: MolecularFeaturizer

Class to generate two-dimensional descriptors. It generates all descriptors from the RDKit library.

class WHIM(mandatory_generation_of_conformers=False)[source]

Bases: ThreeDimensionDescriptor

WHIM descriptors vector Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37

check_atoms_coordinates(mol)[source]

Function to check if a molecule contains zero coordinates in all atoms. Then this molecule must be eliminated.

Example

# Load test set to a frame sdf = ‘miniset.sdf’ df = pt.LoadSDF(sdf, molColName=’mol3DProt’) ## Checking if molecule contains only ZERO coordinates, ## then remove that molecules from dataset df[‘check_coordinates’] = [checkAtomsCoordinates(x) for x in df.mol3DProt] df_eliminated_mols = dfl[df.check_coordinates == False] df = df[df.check_coordinates == True] df.drop(columns=[‘check_coordinates’], inplace=True) print(‘final minitest set:’, df.shape[0]) print(‘minitest eliminated:’, df_eliminated_mols.shape[0])

Parameters

mol (Mol) – Molecule to check coordinates.

Returns

True if molecule is OK and False if molecule contains zero coordinates.

Return type

bool

generate_conformers(generator: ThreeDimensionalMoleculeGenerator, new_mol: Union[Mol, str], etkg_version: int = 1, optimization_mode: str = 'MMFF94')[source]

Method to generate three-dimensional conformers and optimize them.

Parameters
  • generator (ThreeDimensionalMoleculeGenerator) – Class to generate three-dimensional conformers and optimize them.

  • new_mol (Union[Mol, str]) – Mol object from rdkit or SMILES string to generate conformers and optimize them.

  • etkg_version (int) – version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm.

  • optimization_mode (str) – mode for the molecular geometry optimization (MMFF or UFF variants).

Returns

new_mol – Mol object with three-dimensional conformers and optimized molecular geometry.

Return type

Mol

generate_conformers_to_sdf_file(dataset: Dataset, file_path: str, n_conformations: int = 20, max_iterations: int = 5, threads: int = 1, timeout_per_molecule: int = 12, etkg_version: int = 1, optimization_mode: str = 'MMFF94')[source]

Generate conformers using the experimental-torsion-knowledge distance geometry (ETKDG) algorithm from RDKit, optimize them and save in an SDF file.

Parameters
  • dataset (Dataset) – DeepMol Dataset object

  • file_path (str) – file_path where the conformers will be saved.

  • n_conformations (int) – The number of conformations per molecule.

  • max_iterations (int) – Maximum number of iterations for the molecule’s conformers optimization.

  • threads (int) – Number of threads.

  • timeout_per_molecule (int) – The number of seconds in which the conformers are to be generated.

  • etkg_version (int) – Version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm.

  • optimization_mode (str) – Mode for the molecular geometry optimization (MMFF or UFF).

get_all_3D_descriptors(mol)[source]

Method that lists all the methods and uses them to featurize the whole set.

Parameters

mol (Mol) – Mol object from rdkit.

Returns

all_descriptors – List with all the 3D descriptors.

Return type

list

get_all_3D_descriptors_feature_names() List[str][source]

Method that lists all 3D featurizers feature names.

Returns

feature_names – List with all the 3D descriptors feature names.

Return type

List[str]

handler(signum, frame)[source]

deepmol.compound_featurization.rdkit_fingerprints module

class AtomPairFingerprint(nBits: int = 2048, minLength: int = 1, maxLength: int = 30, nBitsPerEntry: int = 4, includeChirality: bool = False, use2D: bool = True, confId: int = -1, **kwargs)[source]

Bases: MolecularFeaturizer

Atom pair fingerprints

Returns the atom-pair fingerprint for a molecule as an ExplicitBitVect

class AtomPairFingerprintCallbackHash(nBits: int = 2048, minLength: int = 1, maxLength: int = 30, includeChirality: bool = False, use2D: bool = True, confId: int = -1, **kwargs)[source]

Bases: MolecularFeaturizer

Atom pair fingerprints

Returns the atom-pair fingerprint for a molecule as an ExplicitBitVect

static hash_function(bit, value)[source]

Hash function for atom pair fingerprint.

Parameters
  • bit (int) – The bit to be hashed.

  • value (int) – The value to be hashed.

class LayeredFingerprint(layerFlags: int = 4294967295, minPath: int = 1, maxPath: int = 7, fpSize: int = 2048, atomCounts: Optional[list] = None, branchedPaths: bool = True, **kwargs)[source]

Bases: MolecularFeaturizer

Calculate layered fingerprint for a single molecule.

Layer definitions:

0x01: pure topology 0x02: bond order 0x04: atom types 0x08: presence of rings 0x10: ring sizes 0x20: aromaticity

class MACCSkeysFingerprint(**kwargs)[source]

Bases: MolecularFeaturizer

MACCS Keys. SMARTS-based implementation of the 166 public MACCS keys.

class MorganFingerprint(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, **kwargs)[source]

Bases: MolecularFeaturizer

Morgan fingerprints. Extended Connectivity Circular Fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size.

class RDKFingerprint(minPath: int = 1, maxPath: int = 7, fpSize: int = 2048, nBitsPerHash: int = 2, useHs: bool = True, tgtDensity: float = 0.0, minSize: int = 128, branchedPaths: bool = True, useBondOrder: bool = True, **kwargs)[source]

Bases: MolecularFeaturizer

RDKit topological fingerprints

This algorithm functions by find all subgraphs between minPath and maxPath in length. For each subgraph:

A hash is calculated.

The hash is used to seed a random-number generator

_nBitsPerHash_ random numbers are generated and used to set the corresponding bits in the fingerprint

Module contents