deepmol.tokenizers package

Submodules

deepmol.tokenizers.atom_level_smiles_tokenizer module

class AtomLevelSmilesTokenizer(n_jobs: int = -1)[source]

Bases: Tokenizer

A tokenizer that tokenizes SMILES strings at the atom level (based on the SMILES grammar (regex)).

Examples

>>> from deepmol.tokenizers import AtomLevelSmilesTokenizer
>>> from deepmol.loaders import CSVLoader
>>> loader = CSVLoader('data_path.csv', smiles_field='Smiles', labels_fields=['Class'])
>>> dataset = loader.create_dataset(sep=";")
>>> tokenizer = AtomLevelSmilesTokenizer().fit(dataset)
>>> tokens = tokenizer.tokenize(dataset)
property max_length: int

Returns the maximum length of the SMILES strings.

Returns:

max_length – The maximum length of the SMILES strings.

Return type:

int

property regex: str

Returns the regex used to tokenize SMILES strings.

Returns:

regex – The regex used to tokenize SMILES strings.

Return type:

str

property vocabulary: list

Returns the vocabulary of the tokenizer.

Returns:

vocabulary – The vocabulary of the tokenizer.

Return type:

list

deepmol.tokenizers.kmer_smiles_tokenizer module

class KmerSmilesTokenizer(size: int = 3, stride: int = 1, n_jobs: int = -1)[source]

Bases: Tokenizer

property atom_level_tokenizer: AtomLevelSmilesTokenizer

Returns the fitted atom-level tokenizer used to tokenize the SMILES strings.

Returns:

atom_level_tokenizer – The fitted atom-level tokenizer used to tokenize the SMILES strings.

Return type:

AtomLevelSmilesTokenizer

property max_length: int

Returns the maximum length (maximum number of tokens) of the SMILES strings.

Returns:

max_length – The maximum length of the SMILES strings.

Return type:

int

property regex: str

Returns the regex used to tokenize the SMILES strings.

Returns:

regex – The regex used to tokenize the SMILES strings.

Return type:

str

property size: int

Returns the size of the k-mers.

Returns:

size – The size of the k-mers.

Return type:

int

property stride: int

Returns the stride of the k-mers (overlap between consecutive k-mers).

Returns:

stride – The stride of the k-mers.

Return type:

int

property vocabulary: list

Returns the vocabulary of the tokenizer.

Returns:

vocabulary – The vocabulary of the tokenizer.

Return type:

list

deepmol.tokenizers.tokenizer module

class Tokenizer(n_jobs: int)[source]

Bases: Estimator, ABC

An abstract class for tokenizers. Tokenizers are used to tokenize strings. Child classes must implement the tokenize method.

abstract property max_length: int

Returns the maximum length of a tokenized string.

Returns:

max_length – The maximum length of a tokenized string.

Return type:

int

tokenize(dataset: Dataset) list[source]

Tokenizes a dataset.

Parameters:

dataset (Dataset) – The dataset to tokenize.

Returns:

dataset – The tokenized dataset.

Return type:

Dataset

abstract property vocabulary: list

Returns the vocabulary.

Returns:

vocabulary – The vocabulary.

Return type:

list

Module contents