deepmol.tokenizers package
Submodules
deepmol.tokenizers.atom_level_smiles_tokenizer module
- class AtomLevelSmilesTokenizer(n_jobs: int = -1)[source]
Bases:
Tokenizer
A tokenizer that tokenizes SMILES strings at the atom level (based on the SMILES grammar (regex)).
Examples
>>> from deepmol.tokenizers import AtomLevelSmilesTokenizer >>> from deepmol.loaders import CSVLoader
>>> loader = CSVLoader('data_path.csv', smiles_field='Smiles', labels_fields=['Class']) >>> dataset = loader.create_dataset(sep=";")
>>> tokenizer = AtomLevelSmilesTokenizer().fit(dataset) >>> tokens = tokenizer.tokenize(dataset)
- property max_length: int
Returns the maximum length of the SMILES strings.
- Returns:
max_length – The maximum length of the SMILES strings.
- Return type:
int
- property regex: str
Returns the regex used to tokenize SMILES strings.
- Returns:
regex – The regex used to tokenize SMILES strings.
- Return type:
str
- property vocabulary: list
Returns the vocabulary of the tokenizer.
- Returns:
vocabulary – The vocabulary of the tokenizer.
- Return type:
list
deepmol.tokenizers.kmer_smiles_tokenizer module
- class KmerSmilesTokenizer(size: int = 3, stride: int = 1, n_jobs: int = -1)[source]
Bases:
Tokenizer
- property atom_level_tokenizer: AtomLevelSmilesTokenizer
Returns the fitted atom-level tokenizer used to tokenize the SMILES strings.
- Returns:
atom_level_tokenizer – The fitted atom-level tokenizer used to tokenize the SMILES strings.
- Return type:
- property max_length: int
Returns the maximum length (maximum number of tokens) of the SMILES strings.
- Returns:
max_length – The maximum length of the SMILES strings.
- Return type:
int
- property regex: str
Returns the regex used to tokenize the SMILES strings.
- Returns:
regex – The regex used to tokenize the SMILES strings.
- Return type:
str
- property size: int
Returns the size of the k-mers.
- Returns:
size – The size of the k-mers.
- Return type:
int
- property stride: int
Returns the stride of the k-mers (overlap between consecutive k-mers).
- Returns:
stride – The stride of the k-mers.
- Return type:
int
- property vocabulary: list
Returns the vocabulary of the tokenizer.
- Returns:
vocabulary – The vocabulary of the tokenizer.
- Return type:
list
deepmol.tokenizers.tokenizer module
- class Tokenizer(n_jobs: int)[source]
Bases:
Estimator
,ABC
An abstract class for tokenizers. Tokenizers are used to tokenize strings. Child classes must implement the tokenize method.
- abstract property max_length: int
Returns the maximum length of a tokenized string.
- Returns:
max_length – The maximum length of a tokenized string.
- Return type:
int
- abstract property vocabulary: list
Returns the vocabulary.
- Returns:
vocabulary – The vocabulary.
- Return type:
list