Tokenizers
Each expert requires a tokenizer. The tokenizer tells EvoProtGrad
the particular amino acid ordering the expert model was trained to use and is called internally to convert a list of strings (the variants) into Torch tensors for the expert model.
- We provide a default tokenizer class
OneHotTokenizer
that can be used to convert a list of strings of amino acids into Torch one-hot tensors in the canonical order. - HuggingFace experts expect a tokenizer of type
PreTrainedTokenizerBase
to be provided. - You can define your own custom tokenizer class by subclassing
evo_prot_grad.common.tokenizers.ExpertTokenizer
and implementing the__call__
anddecode
methods. SeeOneHotTokenizer
for an example.
Canonicalizing amino acid sequence order
A subtlety with combining different one-hot protein sequence models is that the models may each use a different amino acid alphabet. For example, one model may use alphabet 'ACDEFGHIKLMNPQRSTVWY'
while another may have extra <start>
and <end>
tokens or use a different order. If the entries of the one-hot encoded protein sequences for each model do not align, we cannot sum their gradients together for gradient-based MCMC.
To address this, we define a "canonical" amino acid ordering, 'ACDEFGHIKLMNPQRSTVWY'
, and canonicalize the one-hot encoded sequences for each expert model to this ordering. Each expert internally computes and maintains a binary matrix that maps one-hot encoded tensors between the canonical ordering and the ordering used by the expert model (the binary matrix computation code is in evo_prot_grad.common.utils.expert_alphabet_to_canonical
). This matrix is applied to Torch one-hot tensors via matrix multiplication in a differentiable manner.