evo_prot_grad.common.tokenizers
ExpertTokenizer
evo_prot_grad.common.tokenizers.ExpertTokenizer
Bases: abc.ABC
Base interface for custom Expert tokenizers.
__init__(alphabet: List[str]) -> None
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
List[str]
|
A list of amino acid characters. |
required |
get_vocab() -> Dict
Return the vocab, a mapping of amino acid characters to integers.
__call__(seqs: List[str]) -> torch.FloatTensor
abstractmethod
Convert seqs to one hot tensors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
ohs |
torch.FloatTensor
|
of shape [parallel_chains, seq_len, vocab_size] |
decode(ohs: torch.Tensor) -> List[str]
abstractmethod
Convert one-hot tensors back to a list of string sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ohs |
torch.Tensor
|
shape [parallel_chains, seq_len, vocab_size] |
required |
Returns:
Name | Type | Description |
---|---|---|
seqs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
OneHotTokenizer
evo_prot_grad.common.tokenizers.OneHotTokenizer
Bases: ExpertTokenizer
Converts a string of amino acids into one-hot tensors.
get_vocab() -> Dict
Return the vocab, a mapping of amino acid characters to integers.
__init__(alphabet: List[str])
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
List[str]
|
A list of amino acid characters. |
required |
__call__(seqs: List[str]) -> torch.FloatTensor
Convert seqs to one hot tensors. Assumes each sequence is the same length. Handles sequences with spaces between amino acids.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seqs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
ohs |
torch.FloatTensor
|
of shape [parallel_chains, seq_len, vocab_size] |
decode(ohs: torch.Tensor) -> List[str]
Convert one-hot tensors back to a list of string sequences with a space between each amino acid.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ohs |
torch.Tensor
|
shape [parallel_chains, seq_len, vocab_size] |
required |
Returns:
Name | Type | Description |
---|---|---|
seqs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |