evo_prot_grad.experts
Expert
evo_prot_grad.experts.base_experts.Expert
Bases: ABC
Defines a common interface for any type of expert.
__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str = 'cpu')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Hyperparameter for re-scaling this expert in the Product of Experts. |
required |
model |
nn.Module
|
The model to use for the expert. |
required |
vocab |
Dict
|
The vocabulary for the expert. |
required |
scoring_strategy |
str
|
The approach used to score mutations with this expert. |
required |
device |
str
|
The device to use for the expert. |
'cpu'
|
_get_last_one_hots() -> torch.Tensor
abstractmethod
Abstract method to be defined, which implements how the one-hot tensors most recently passed as input to this expert can be returned.
The one-hot tensors are cached and accessed from a evo_prot_grad.common.embeddings.OneHotEmbedding module, which we configure each expert to use.
Warning
This assumes that the desired one-hot tensors are the last tensors passed as input to the expert. If the expert is called twice, this will return the one-hot tensors from the second call. This is intended to address the issue that some experts take lists of strings as input and internally converts them into one-hot tensors.
init_wildtype(wt_seq: str) -> None
Set the one-hot encoded wildtype sequence for this expert.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
wt_seq |
str
|
The wildtype sequence. |
required |
tokenize(inputs: List[str]) -> Any
abstractmethod
Tokenizes a list of protein sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequences. |
required |
Returns:
Name | Type | Description |
---|---|---|
tokens |
Any
|
tokenized sequence in whatever format the expert requires. |
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
abstractmethod
Abstract method to be defined, which wraps around the forward pass of the expert's model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequences. |
required |
Returns:
Name | Type | Description |
---|---|---|
oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
model_preds |
torch.Tensor
|
of shape [parallel_chains, *]. |
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
abstractmethod
Return the expert score for a batch of protein sequences as well as the one-hot encoded input sequences for which a gradient can be computed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
expert_score |
torch.Tensor
|
of shape [parallel_chains] |
ProteinLMExpert
evo_prot_grad.experts.base_experts.ProteinLMExpert
Bases: Expert
An expert for protein language models (pLMs).
Assumes the pLM predicts a logit score for each amino acid.
Implements abstract methods get_model_output
and __call__
.
Create a sub-class of this class to add a new HuggingFace pLM expert.
__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Hyperparameter for re-scaling this expert in the Product of Experts. |
required |
model |
nn.Module
|
The model to use for the expert. |
required |
vocab |
Dict
|
The vocab to use for the expert. |
required |
device |
str
|
The device to use for the expert. |
required |
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Returns the one-hot sequences and logits for each amino acid in the input sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
x_oh |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
logits |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Returns the one-hot sequences and expert score. Assumes the pLM predicts a logit score for each amino acid.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
expert_score |
torch.Tensor
|
of shape [parallel_chains] |
BERTExpert
evo_prot_grad.experts.bert_expert.BERTExpert
Bases: ProteinLMExpert
Expert sub-class for BERT-style HuggingFace protein language models.
Implements abstract methods _get_last_one_hots
and tokenize
.
Swaps out the BertForMaskedLM.bert.embeddings.word_embeddings
layer
for a evo_prot_grad.common.embeddings.OneHotEmbedding
layer.
__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Temperature for sampling from the expert. |
required |
scoring_strategy |
str
|
Approach for scoring variants that the expert will use. |
required |
model |
nn.Module
|
The model to use for the expert. |
None
|
tokenizer |
PreTrainedTokenizerBase
|
The tokenizer to use for the expert. |
None
|
device |
str
|
The device to use for the expert. |
'cpu'
|
Raises:
Type | Description |
---|---|
ValueError
|
If either |
_get_last_one_hots() -> torch.Tensor
Returns the one-hot tensors most recently passed as input.
Returns:
Name | Type | Description |
---|---|---|
one_hots |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
tokenize(inputs) -> BatchEncoding
Convert inputs to a format suitable for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
batch_encoding |
BatchEncoding
|
A BatchEncoding object. |
CausalLMExpert
evo_prot_grad.experts.causallm_expert.CausalLMExpert
Bases: ProteinLMExpert
Expert sub-class for autoregressive (causal) HuggingFace protein language models.
Implements abstract methods _get_last_one_hots
and tokenize
.
Swaps out the AutoModelForCausalLM.transformer.embedding
layer
for a evo_prot_grad.common.embeddings.OneHotEmbedding
layer.
__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Temperature for sampling from the expert. |
required |
scoring_strategy |
str
|
Approach for scoring variants that the expert will use. |
required |
model |
nn.Module
|
The model to use for the expert. Defaults to AutoModelForCausalLM from lightonai/RITA_s. |
None
|
tokenizer |
PreTrainedTokenizerBase
|
The tokenizer to use for the expert. Defaults to AutoTokenizer from lightonai/RITA_s. |
None
|
device |
str
|
The device to use for the expert. Defaults to 'cpu'. |
'cpu'
|
Raises:
Type | Description |
---|---|
ValueError
|
If either |
_get_last_one_hots()
Returns the one-hot tensors most recently passed as input.
tokenize(inputs: List[str]) -> BatchEncoding
Convert inputs to a format suitable for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
batch_encoding |
BatchEncoding
|
A BatchEncoding object. |
EsmExpert
evo_prot_grad.experts.esm_expert.EsmExpert
Bases: ProteinLMExpert
Expert baseclass for HuggingFace protein language models from the ESM family.
Implements abstract methods _get_last_one_hots
and tokenize
.
Swaps out the EsmForMaskedLM.esm.embeddings.word_embeddings
layer
for a evo_prot_grad.common.embeddings.OneHotEmbedding
layer.
__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Temperature for sampling from the expert. |
required |
scoring_strategy |
str
|
Approach for scoring variants that the expert will use. |
required |
model |
nn.Module
|
The model to use for the expert. Defaults to EsmForMaskedLM from facebook/esm2_t6_8M_UR50D. |
None
|
tokenizer |
PreTrainedTokenizerBase
|
The tokenizer to use for the expert. Defaults to AutoTokenizer from facebook/esm2_t6_8M_UR50D. |
None
|
device |
str
|
The device to use for the expert. Defaults to 'cpu'. |
'cpu'
|
Raises:
Type | Description |
---|---|
ValueError
|
If either |
_get_last_one_hots() -> torch.Tensor
Returns the one-hot tensors most recently passed as input.
tokenize(inputs: List[str]) -> BatchEncoding
Convert inputs to a format suitable for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
batch_encoding |
BatchEncoding
|
A BatchEncoding object. |
AttributeExpert
evo_prot_grad.experts.base_experts.AttributeExpert
Bases: Expert
Interface for experts trained (typically with supervised learning)
to predict an attribute (e.g., activity or stability) from one-hot encoded sequences.
Implements abstract methods tokenize
, get_model_output
, __call__
.
__init__(temperature: float, model: nn.Module, scoring_strategy: str, device: str, tokenizer: Optional[tokenizers.ExpertTokenizer] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Hyperparameter for re-scaling this expert in the Product of Experts. |
required |
model |
nn.Module
|
The model to use for the expert. |
required |
scoring_strategy |
str
|
The approach used to score mutations with this expert. |
required |
tokenizer |
ExpertTokenizer
|
The tokenizer to use for the expert. |
None
|
device |
str
|
The device to use for the expert. |
required |
tokenize(inputs: List[str])
Tokenizes a list of protein sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequences. |
required |
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Returns both the onehot-encoded inputs and model's predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
x_oh |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
attribute_values |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
_get_last_one_hots() -> torch.Tensor
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
x_oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
score |
torch.Tensor
|
of shape [parallel_chains] |
EVCouplingsExpert
evo_prot_grad.experts.evcouplings_expert.EVCouplingsExpert
Bases: Expert
Expert class for EVCouplings Potts models. EVCouplings lib uses the canonical alphabet by default.
Implements abstract methods _get_last_one_hots
, tokenize
, get_model_output
, __call__
.
__init__(temperature: float, scoring_strategy: str, model: potts.EVCouplings, device: str, tokenizer: Optional[OneHotTokenizer] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Temperature for sampling from the expert. |
required |
scoring_strategy |
str
|
Approach for scoring variants that the expert will use. |
required |
model |
potts.EVCouplings
|
The model to use for the expert. |
required |
device |
str
|
The device to use for the expert. |
required |
tokenizer |
Optional[OneHotTokenizer]
|
The tokenizer to use for the expert. If None, uses OneHotTokenizer(utils.CANONICAL_ALPHABET, device). |
None
|
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Compute the wildtype-normalized Hamiltonian expert score.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
expert_score |
torch.Tensor
|
of shape [parallel_chains] |
init_wildtype(wt_seq: str) -> None
Set the one-hot encoded wildtype sequence for this expert.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
wt_seq |
str
|
The wildtype sequence. |
required |
OneHotDownstreamExpert
evo_prot_grad.experts.onehot_downstream_regression_expert.OneHotDownstreamRegressionExpert
Bases: AttributeExpert
Basic one-hot regression expert.
__init__(temperature: float, scoring_strategy: str, model: Module, device: str, tokenizer: Optional[OneHotTokenizer] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temperature |
float
|
Temperature for sampling from the expert. |
required |
scoring_strategy |
str
|
Approach for scoring variants that the expert will use. |
required |
model |
Module
|
The model to use for the expert. |
required |
device |
str
|
The device to use for the expert. |
required |
tokenizer |
Optional[OneHotTokenizer]
|
The tokenizer to use for the expert. If None, a OneHotTokenizer will be constructed. Defaults to None. |
None
|
init_wildtype(wt_seq: str) -> None
Set the one-hot encoded wildtype sequence for this expert.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
wt_seq |
str
|
The wildtype sequence. |
required |
tokenize(inputs: List[str])
Tokenizes a list of protein sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequences. |
required |
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Returns both the onehot-encoded inputs and model's predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
x_oh |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
attribute_values |
torch.Tensor
|
(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size] |
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[str]
|
A list of protein sequence strings of len [parallel_chains]. |
required |
Returns:
Name | Type | Description |
---|---|---|
x_oh |
torch.Tensor
|
of shape [parallel_chains, seq_len, vocab_size] |
score |
torch.Tensor
|
of shape [parallel_chains] |