deepmol.compound_featurization package
Submodules
deepmol.compound_featurization.base_featurizer module
- class MolecularFeaturizer(n_jobs: int = -1)[source]
Bases:
ABCAbstract class for calculating a set of features for a molecule. A MolecularFeaturizer uses SMILES strings or RDKit molecule objects to represent molecules.
Subclasses need to implement the _featurize method for calculating features for a single molecule.
- featurize(dataset: Dataset, scaler: Optional[BaseScaler] = None, path_to_save_scaler: Optional[str] = None, remove_nans_axis: int = 0) Dataset[source]
Calculate features for molecules.
- Parameters
dataset (Dataset) – The dataset containing the molecules to featurize in dataset.mols.
scaler (BaseScaler) – The scaler to use for scaling the generated features.
path_to_save_scaler (str) – The path to save the scaler to.
remove_nans_axis (int) – The axis to remove NaNs from. If None, no NaNs are removed.
- Returns
dataset – The input Dataset containing a featurized representation of the molecules in Dataset.X.
- Return type
deepmol.compound_featurization.deepchem_featurizers module
- class ConvMolFeat(master_atom: bool = False, use_chirality: bool = False, atom_properties: Optional[List[str]] = None, per_atom_fragmentation: bool = False, **kwargs)[source]
Bases:
MolecularFeaturizerDuvenaud graph convolution, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#convmolfeaturizer). Vector of descriptors for each atom in a molecule. The featurizers computes that vector of local descriptors.
References: Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.
- class CoulombEigFeat(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, max_conformers: int = 1, seed: Optional[int] = None, **kwargs)[source]
Bases:
MolecularFeaturizerCalculate the eigen values of Coulomb matrices for molecules. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#coulombmatrixeig).
References: Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.
- class CoulombFeat(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, max_conformers: int = 1, seed: Optional[int] = None, **kwargs)[source]
Bases:
MolecularFeaturizerCalculate coulomb matrices for molecules. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#coulombmatrix).
References: Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.
- class MolGanFeat(max_atom_count: int = 9, kekulize: bool = True, bond_labels: Optional[List[Any]] = None, atom_labels: Optional[List[int]] = None, **kwargs)[source]
Bases:
MolecularFeaturizerFeaturizer for MolGAN de-novo molecular generation model, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html?highlight=CGCNN#molganfeaturizer). It is wrapper for two matrices containing atom and bond type information.
References: Nicola De Cao et al. “MolGAN: An implicit generative model for small molecular graphs” (2018), https://arxiv.org/abs/1805.11973
- class MolGraphConvFeat(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False, **kwargs)[source]
Bases:
MolecularFeaturizerFeaturizer of general graph convolution networks for molecules. Adapted from deepchem: (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#molgraphconvfeaturizer)
References: Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016):595-608.
- class SmileImageFeat(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std', **kwargs)[source]
Bases:
MolecularFeaturizerConverts SMILE string to image. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#smilestoimage).
References: Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
- class SmilesSeqFeat(char_to_idx: Optional[Dict[str, int]] = None, max_len: int = 250, pad_len: int = 10)[source]
Bases:
objectTakes SMILES strings and turns into a sequence. Adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#smilestoseq).
References: Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
- class WeaveFeat(graph_distance: bool = True, explicit_h: bool = False, use_chirality: bool = False, max_pair_distance: Optional[int] = None, **kwargs)[source]
Bases:
MolecularFeaturizerWeave convolution featurization, adapted from deepchem (https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#weavefeaturizer). Require a quadratic matrix of interaction descriptors for each pair of atoms.
References: Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.
deepmol.compound_featurization.mixed_descriptors module
- class MixedFeaturizer(featurizers: Iterable[MolecularFeaturizer], **kwargs)[source]
Bases:
MolecularFeaturizerClass to perform multiple types of featurizers. Features from different featurizers are concatenated.
deepmol.compound_featurization.mol2vec module
- class Mol2Vec(pretrain_model_path: Optional[str] = None, radius: int = 1, unseen: str = 'UNK', gather_method: str = 'sum')[source]
Bases:
MolecularFeaturizerMol2Vec fingerprint implementation from https://doi.org/10.1021/acs.jcim.7b00616
Inspired by natural language processing techniques, Mol2vec, which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Mol2vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing the vectors of the individual substructures and, for instance, be fed into supervised machine learning approaches to predict compound properties.
- sentences2vec(sentences: Iterable, model: Word2Vec, unseen: Optional[str] = None)[source]
Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.
- Parameters
sentences (Iterable) – List with sentences
model (Word2Vec) – Gensim Word2Vec model
unseen (None, str) – Keyword for unseen words. If None, those words are skipped. https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032
- Returns
Array of vectors for each sentence.
- Return type
np.array
deepmol.compound_featurization.rdkit_descriptors module
- class All3DDescriptors(mandatory_generation_of_conformers=True)[source]
Bases:
MolecularFeaturizerClass to generate all three-dimensional descriptors.
- class Asphericity(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate molecular Asphericity A. Baumgaertner, “Shapes of flexible vesicles” J. Chem. Phys. 98:7496 (1993) https://doi.org/10.1063/1.464689
- class AutoCorr3D(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorAutoCorr3D. Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- class Eccentricity(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate molecular eccentricity G. A. Arteca “Molecular Shape Descriptors” Reviews in Computational Chemistry vol 9 https://doi.org/10.1002/9780470125861.ch5
- class InertialShapeFactor(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate Inertial Shape Factor Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- class MORSE(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorMolecule Representation of Structures based on Electron diffraction descriptors Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- class NormalizedPrincipalMomentsRatios(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorNormalized principal moments ratios. Sauer and Schwarz JCIM 43:987-1003 (2003)
- class PlaneOfBestFit(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorPlane of best fit Nicholas C. Firth, Nathan Brown, and Julian Blagg, JCIM 52:2516-25
- class PrincipalMomentsOfInertia(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate Principal Moments of Inertia
- class RadialDistributionFunction(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorRadial distribution function Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- class RadiusOfGyration(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate Radius of Gyration G. A. Arteca “Molecular Shape Descriptors” Reviews in Computational Chemistry vol 9 https://doi.org/10.1002/9780470125861.ch5
- class SpherocityIndex(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorCalculate molecular Spherocity Index Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- class ThreeDimensionDescriptor(mandatory_generation_of_conformers: bool, **kwargs)[source]
Bases:
MolecularFeaturizerClass to generate three-dimensional descriptors.
- property descriptor_function
Get the descriptor function.
- class ThreeDimensionalMoleculeGenerator(n_conformations: int = 5, max_iterations: int = 5, threads: int = 1, timeout_per_molecule: int = 40)[source]
Bases:
objectClass to generate three-dimensional conformers and optimize them.
- static check_if_mol_has_explicit_hydrogens(new_mol: Mol)[source]
Method to check if a molecule has explicit hydrogens.
- Parameters
new_mol (Mol) – Mol object from rdkit.
- Returns
True if molecule has explicit hydrogens and False if not.
- Return type
bool
- generate_conformers(new_mol: Mol, etkdg_version: int = 1, **kwargs)[source]
method to generate three-dimensional conformers
- Parameters
new_mol (Mol) – Mol object from rdkit
etkdg_version (int) – version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm
kwargs (dict) – Parameters for the ETKDG algorithm.
- Returns
new_mol – Mol object with three-dimensional conformers.
- Return type
Mol
- optimize_molecular_geometry(mol: Mol, mode: str = 'MMFF94')[source]
Class to generate three-dimensional conformers
- Parameters
mol (Mol) – Mol object from rdkit.
mode (str) – mode for the molecular geometry optimization (MMFF or UFF variants).
- Returns
mol – Mol object with optimized molecular geometry.
- Return type
Mol
- class TwoDimensionDescriptors(**kwargs)[source]
Bases:
MolecularFeaturizerClass to generate two-dimensional descriptors. It generates all descriptors from the RDKit library.
- class WHIM(mandatory_generation_of_conformers=False)[source]
Bases:
ThreeDimensionDescriptorWHIM descriptors vector Todeschini and Consoni “Descriptors from Molecular Geometry” Handbook of Chemoinformatics https://doi.org/10.1002/9783527618279.ch37
- check_atoms_coordinates(mol)[source]
Function to check if a molecule contains zero coordinates in all atoms. Then this molecule must be eliminated.
Example
# Load test set to a frame sdf = ‘miniset.sdf’ df = pt.LoadSDF(sdf, molColName=’mol3DProt’) ## Checking if molecule contains only ZERO coordinates, ## then remove that molecules from dataset df[‘check_coordinates’] = [checkAtomsCoordinates(x) for x in df.mol3DProt] df_eliminated_mols = dfl[df.check_coordinates == False] df = df[df.check_coordinates == True] df.drop(columns=[‘check_coordinates’], inplace=True) print(‘final minitest set:’, df.shape[0]) print(‘minitest eliminated:’, df_eliminated_mols.shape[0])
- Parameters
mol (Mol) – Molecule to check coordinates.
- Returns
True if molecule is OK and False if molecule contains zero coordinates.
- Return type
bool
- generate_conformers(generator: ThreeDimensionalMoleculeGenerator, new_mol: Union[Mol, str], etkg_version: int = 1, optimization_mode: str = 'MMFF94')[source]
Method to generate three-dimensional conformers and optimize them.
- Parameters
generator (ThreeDimensionalMoleculeGenerator) – Class to generate three-dimensional conformers and optimize them.
new_mol (Union[Mol, str]) – Mol object from rdkit or SMILES string to generate conformers and optimize them.
etkg_version (int) – version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm.
optimization_mode (str) – mode for the molecular geometry optimization (MMFF or UFF variants).
- Returns
new_mol – Mol object with three-dimensional conformers and optimized molecular geometry.
- Return type
Mol
- generate_conformers_to_sdf_file(dataset: Dataset, file_path: str, n_conformations: int = 20, max_iterations: int = 5, threads: int = 1, timeout_per_molecule: int = 12, etkg_version: int = 1, optimization_mode: str = 'MMFF94')[source]
Generate conformers using the experimental-torsion-knowledge distance geometry (ETKDG) algorithm from RDKit, optimize them and save in an SDF file.
- Parameters
dataset (Dataset) – DeepMol Dataset object
file_path (str) – file_path where the conformers will be saved.
n_conformations (int) – The number of conformations per molecule.
max_iterations (int) – Maximum number of iterations for the molecule’s conformers optimization.
threads (int) – Number of threads.
timeout_per_molecule (int) – The number of seconds in which the conformers are to be generated.
etkg_version (int) – Version of the experimental-torsion-knowledge distance geometry (ETKDG) algorithm.
optimization_mode (str) – Mode for the molecular geometry optimization (MMFF or UFF).
- get_all_3D_descriptors(mol)[source]
Method that lists all the methods and uses them to featurize the whole set.
- Parameters
mol (Mol) – Mol object from rdkit.
- Returns
all_descriptors – List with all the 3D descriptors.
- Return type
list
deepmol.compound_featurization.rdkit_fingerprints module
- class AtomPairFingerprint(nBits: int = 2048, minLength: int = 1, maxLength: int = 30, nBitsPerEntry: int = 4, includeChirality: bool = False, use2D: bool = True, confId: int = -1, **kwargs)[source]
Bases:
MolecularFeaturizerAtom pair fingerprints
Returns the atom-pair fingerprint for a molecule as an ExplicitBitVect
- class AtomPairFingerprintCallbackHash(nBits: int = 2048, minLength: int = 1, maxLength: int = 30, includeChirality: bool = False, use2D: bool = True, confId: int = -1, **kwargs)[source]
Bases:
MolecularFeaturizerAtom pair fingerprints
Returns the atom-pair fingerprint for a molecule as an ExplicitBitVect
- class LayeredFingerprint(layerFlags: int = 4294967295, minPath: int = 1, maxPath: int = 7, fpSize: int = 2048, atomCounts: Optional[list] = None, branchedPaths: bool = True, **kwargs)[source]
Bases:
MolecularFeaturizerCalculate layered fingerprint for a single molecule.
- Layer definitions:
0x01: pure topology 0x02: bond order 0x04: atom types 0x08: presence of rings 0x10: ring sizes 0x20: aromaticity
- class MACCSkeysFingerprint(**kwargs)[source]
Bases:
MolecularFeaturizerMACCS Keys. SMARTS-based implementation of the 166 public MACCS keys.
- class MorganFingerprint(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, **kwargs)[source]
Bases:
MolecularFeaturizerMorgan fingerprints. Extended Connectivity Circular Fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size.
- class RDKFingerprint(minPath: int = 1, maxPath: int = 7, fpSize: int = 2048, nBitsPerHash: int = 2, useHs: bool = True, tgtDensity: float = 0.0, minSize: int = 128, branchedPaths: bool = True, useBondOrder: bool = True, **kwargs)[source]
Bases:
MolecularFeaturizerRDKit topological fingerprints
This algorithm functions by find all subgraphs between minPath and maxPath in length. For each subgraph:
A hash is calculated.
The hash is used to seed a random-number generator
_nBitsPerHash_ random numbers are generated and used to set the corresponding bits in the fingerprint