Skip to content

evo_prot_grad.common.tokenizers

ExpertTokenizer

evo_prot_grad.common.tokenizers.ExpertTokenizer

Bases: abc.ABC

Base interface for custom Expert tokenizers.

__init__(alphabet: List[str]) -> None

Parameters:

Name Type Description Default
alphabet List[str]

A list of amino acid characters.

required
get_vocab() -> Dict

Return the vocab, a mapping of amino acid characters to integers.

__call__(seqs: List[str]) -> torch.FloatTensor abstractmethod

Convert seqs to one hot tensors.

Parameters:

Name Type Description Default
seqs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
ohs torch.FloatTensor

of shape [parallel_chains, seq_len, vocab_size]

decode(ohs: torch.Tensor) -> List[str] abstractmethod

Convert one-hot tensors back to a list of string sequences.

Parameters:

Name Type Description Default
ohs torch.Tensor

shape [parallel_chains, seq_len, vocab_size]

required

Returns:

Name Type Description
seqs List[str]

A list of protein sequence strings of len [parallel_chains].

OneHotTokenizer

evo_prot_grad.common.tokenizers.OneHotTokenizer

Bases: ExpertTokenizer

Converts a string of amino acids into one-hot tensors.

get_vocab() -> Dict

Return the vocab, a mapping of amino acid characters to integers.

__init__(alphabet: List[str])

Parameters:

Name Type Description Default
alphabet List[str]

A list of amino acid characters.

required
__call__(seqs: List[str]) -> torch.FloatTensor

Convert seqs to one hot tensors. Assumes each sequence is the same length. Handles sequences with spaces between amino acids.

Parameters:

Name Type Description Default
seqs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
ohs torch.FloatTensor

of shape [parallel_chains, seq_len, vocab_size]

decode(ohs: torch.Tensor) -> List[str]

Convert one-hot tensors back to a list of string sequences with a space between each amino acid.

Parameters:

Name Type Description Default
ohs torch.Tensor

shape [parallel_chains, seq_len, vocab_size]

required

Returns:

Name Type Description
seqs List[str]

A list of protein sequence strings of len [parallel_chains].