Skip to content

evo_prot_grad.experts


Expert

evo_prot_grad.experts.base_experts.Expert

Bases: ABC

Defines a common interface for any type of expert.

__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str = 'cpu')

Parameters:

Name Type Description Default
temperature float

Hyperparameter for re-scaling this expert in the Product of Experts.

required
model nn.Module

The model to use for the expert.

required
vocab Dict

The vocabulary for the expert.

required
scoring_strategy str

The approach used to score mutations with this expert.

required
device str

The device to use for the expert.

'cpu'
_get_last_one_hots() -> torch.Tensor abstractmethod

Abstract method to be defined, which implements how the one-hot tensors most recently passed as input to this expert can be returned.

The one-hot tensors are cached and accessed from a evo_prot_grad.common.embeddings.OneHotEmbedding module, which we configure each expert to use.

Warning

This assumes that the desired one-hot tensors are the last tensors passed as input to the expert. If the expert is called twice, this will return the one-hot tensors from the second call. This is intended to address the issue that some experts take lists of strings as input and internally converts them into one-hot tensors.

init_wildtype(wt_seq: str) -> None

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name Type Description Default
wt_seq str

The wildtype sequence.

required
tokenize(inputs: List[str]) -> Any abstractmethod

Tokenizes a list of protein sequences.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequences.

required

Returns:

Name Type Description
tokens Any

tokenized sequence in whatever format the expert requires.

get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor] abstractmethod

Abstract method to be defined, which wraps around the forward pass of the expert's model.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequences.

required

Returns:

Name Type Description
oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

model_preds torch.Tensor

of shape [parallel_chains, *].

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor] abstractmethod

Return the expert score for a batch of protein sequences as well as the one-hot encoded input sequences for which a gradient can be computed.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

expert_score torch.Tensor

of shape [parallel_chains]

ProteinLMExpert

evo_prot_grad.experts.base_experts.ProteinLMExpert

Bases: Expert

An expert for protein language models (pLMs). Assumes the pLM predicts a logit score for each amino acid. Implements abstract methods get_model_output and __call__.

Create a sub-class of this class to add a new HuggingFace pLM expert.

__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str)

Parameters:

Name Type Description Default
temperature float

Hyperparameter for re-scaling this expert in the Product of Experts.

required
model nn.Module

The model to use for the expert.

required
vocab Dict

The vocab to use for the expert.

required
device str

The device to use for the expert.

required
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Returns the one-hot sequences and logits for each amino acid in the input sequence.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
x_oh torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

logits torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Returns the one-hot sequences and expert score. Assumes the pLM predicts a logit score for each amino acid.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

expert_score torch.Tensor

of shape [parallel_chains]

BERTExpert

evo_prot_grad.experts.bert_expert.BERTExpert

Bases: ProteinLMExpert

Expert sub-class for BERT-style HuggingFace protein language models. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the BertForMaskedLM.bert.embeddings.word_embeddings layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

Parameters:

Name Type Description Default
temperature float

Temperature for sampling from the expert.

required
scoring_strategy str

Approach for scoring variants that the expert will use.

required
model nn.Module

The model to use for the expert.

None
tokenizer PreTrainedTokenizerBase

The tokenizer to use for the expert.

None
device str

The device to use for the expert.

'cpu'

Raises:

Type Description
ValueError

If either model or tokenizer is not specified.

_get_last_one_hots() -> torch.Tensor

Returns the one-hot tensors most recently passed as input.

Returns:

Name Type Description
one_hots torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

tokenize(inputs) -> BatchEncoding

Convert inputs to a format suitable for the model.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
batch_encoding BatchEncoding

A BatchEncoding object.

CausalLMExpert

evo_prot_grad.experts.causallm_expert.CausalLMExpert

Bases: ProteinLMExpert

Expert sub-class for autoregressive (causal) HuggingFace protein language models. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the AutoModelForCausalLM.transformer.embedding layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

Parameters:

Name Type Description Default
temperature float

Temperature for sampling from the expert.

required
scoring_strategy str

Approach for scoring variants that the expert will use.

required
model nn.Module

The model to use for the expert. Defaults to AutoModelForCausalLM from lightonai/RITA_s.

None
tokenizer PreTrainedTokenizerBase

The tokenizer to use for the expert. Defaults to AutoTokenizer from lightonai/RITA_s.

None
device str

The device to use for the expert. Defaults to 'cpu'.

'cpu'

Raises:

Type Description
ValueError

If either model or tokenizer is not specified.

_get_last_one_hots()

Returns the one-hot tensors most recently passed as input.

tokenize(inputs: List[str]) -> BatchEncoding

Convert inputs to a format suitable for the model.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
batch_encoding BatchEncoding

A BatchEncoding object.

EsmExpert

evo_prot_grad.experts.esm_expert.EsmExpert

Bases: ProteinLMExpert

Expert baseclass for HuggingFace protein language models from the ESM family. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the EsmForMaskedLM.esm.embeddings.word_embeddings layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

Parameters:

Name Type Description Default
temperature float

Temperature for sampling from the expert.

required
scoring_strategy str

Approach for scoring variants that the expert will use.

required
model nn.Module

The model to use for the expert. Defaults to EsmForMaskedLM from facebook/esm2_t6_8M_UR50D.

None
tokenizer PreTrainedTokenizerBase

The tokenizer to use for the expert. Defaults to AutoTokenizer from facebook/esm2_t6_8M_UR50D.

None
device str

The device to use for the expert. Defaults to 'cpu'.

'cpu'

Raises:

Type Description
ValueError

If either model or tokenizer is not specified.

_get_last_one_hots() -> torch.Tensor

Returns the one-hot tensors most recently passed as input.

tokenize(inputs: List[str]) -> BatchEncoding

Convert inputs to a format suitable for the model.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
batch_encoding BatchEncoding

A BatchEncoding object.

AttributeExpert

evo_prot_grad.experts.base_experts.AttributeExpert

Bases: Expert

Interface for experts trained (typically with supervised learning) to predict an attribute (e.g., activity or stability) from one-hot encoded sequences. Implements abstract methods tokenize, get_model_output, __call__.

__init__(temperature: float, model: nn.Module, scoring_strategy: str, device: str, tokenizer: Optional[tokenizers.ExpertTokenizer] = None)

Parameters:

Name Type Description Default
temperature float

Hyperparameter for re-scaling this expert in the Product of Experts.

required
model nn.Module

The model to use for the expert.

required
scoring_strategy str

The approach used to score mutations with this expert.

required
tokenizer ExpertTokenizer

The tokenizer to use for the expert.

None
device str

The device to use for the expert.

required
tokenize(inputs: List[str])

Tokenizes a list of protein sequences.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequences.

required
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Returns both the onehot-encoded inputs and model's predictions.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
x_oh torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

attribute_values torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

_get_last_one_hots() -> torch.Tensor
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
x_oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

score torch.Tensor

of shape [parallel_chains]


EVCouplingsExpert

evo_prot_grad.experts.evcouplings_expert.EVCouplingsExpert

Bases: Expert

Expert class for EVCouplings Potts models. EVCouplings lib uses the canonical alphabet by default.

Implements abstract methods _get_last_one_hots, tokenize, get_model_output, __call__.

__init__(temperature: float, scoring_strategy: str, model: potts.EVCouplings, device: str, tokenizer: Optional[OneHotTokenizer] = None)

Parameters:

Name Type Description Default
temperature float

Temperature for sampling from the expert.

required
scoring_strategy str

Approach for scoring variants that the expert will use.

required
model potts.EVCouplings

The model to use for the expert.

required
device str

The device to use for the expert.

required
tokenizer Optional[OneHotTokenizer]

The tokenizer to use for the expert. If None, uses OneHotTokenizer(utils.CANONICAL_ALPHABET, device).

None
__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Compute the wildtype-normalized Hamiltonian expert score.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

expert_score torch.Tensor

of shape [parallel_chains]

init_wildtype(wt_seq: str) -> None

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name Type Description Default
wt_seq str

The wildtype sequence.

required

OneHotDownstreamExpert

evo_prot_grad.experts.onehot_downstream_regression_expert.OneHotDownstreamRegressionExpert

Bases: AttributeExpert

Basic one-hot regression expert.

__init__(temperature: float, scoring_strategy: str, model: Module, device: str, tokenizer: Optional[OneHotTokenizer] = None)

Parameters:

Name Type Description Default
temperature float

Temperature for sampling from the expert.

required
scoring_strategy str

Approach for scoring variants that the expert will use.

required
model Module

The model to use for the expert.

required
device str

The device to use for the expert.

required
tokenizer Optional[OneHotTokenizer]

The tokenizer to use for the expert. If None, a OneHotTokenizer will be constructed. Defaults to None.

None
init_wildtype(wt_seq: str) -> None

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name Type Description Default
wt_seq str

The wildtype sequence.

required
tokenize(inputs: List[str])

Tokenizes a list of protein sequences.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequences.

required
get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Returns both the onehot-encoded inputs and model's predictions.

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
x_oh torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

attribute_values torch.Tensor

(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

Parameters:

Name Type Description Default
inputs List[str]

A list of protein sequence strings of len [parallel_chains].

required

Returns:

Name Type Description
x_oh torch.Tensor

of shape [parallel_chains, seq_len, vocab_size]

score torch.Tensor

of shape [parallel_chains]