evo_prot_grad.experts

Expert

`evo_prot_grad.experts.base_experts.Expert`

Bases: ABC

Defines a common interface for any type of expert.

`init(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str = 'cpu')`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Hyperparameter for re-scaling this expert in the Product of Experts.	required
`model`	`nn.Module`	The model to use for the expert.	required
`vocab`	`Dict`	The vocabulary for the expert.	required
`scoring_strategy`	`str`	The approach used to score mutations with this expert.	required
`device`	`str`	The device to use for the expert.	`'cpu'`

`_get_last_one_hots() -> torch.Tensor` `abstractmethod`

Abstract method to be defined, which implements how the one-hot tensors most recently passed as input to this expert can be returned.

The one-hot tensors are cached and accessed from a evo_prot_grad.common.embeddings.OneHotEmbedding module, which we configure each expert to use.

Warning

This assumes that the desired one-hot tensors are the last tensors passed as input to the expert. If the expert is called twice, this will return the one-hot tensors from the second call. This is intended to address the issue that some experts take lists of strings as input and internally converts them into one-hot tensors.

`init_wildtype(wt_seq: str) -> None`

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name	Type	Description	Default
`wt_seq`	`str`	The wildtype sequence.	required

`tokenize(inputs: List[str]) -> Any` `abstractmethod`

Tokenizes a list of protein sequences.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequences.	required

Returns:

Name	Type	Description
`tokens`	`Any`	tokenized sequence in whatever format the expert requires.

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]` `abstractmethod`

Abstract method to be defined, which wraps around the forward pass of the expert's model.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequences.	required

Returns:

Name	Type	Description
`oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`model_preds`	`torch.Tensor`	of shape [parallel_chains, *].

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]` `abstractmethod`

Return the expert score for a batch of protein sequences as well as the one-hot encoded input sequences for which a gradient can be computed.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`expert_score`	`torch.Tensor`	of shape [parallel_chains]

ProteinLMExpert

`evo_prot_grad.experts.base_experts.ProteinLMExpert`

Bases: Expert

An expert for protein language models (pLMs). Assumes the pLM predicts a logit score for each amino acid. Implements abstract methods get_model_output and __call__.

Create a sub-class of this class to add a new HuggingFace pLM expert.

`init(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str)`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Hyperparameter for re-scaling this expert in the Product of Experts.	required
`model`	`nn.Module`	The model to use for the expert.	required
`vocab`	`Dict`	The vocab to use for the expert.	required
`device`	`str`	The device to use for the expert.	required

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Returns the one-hot sequences and logits for each amino acid in the input sequence.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`x_oh`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]
`logits`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Returns the one-hot sequences and expert score. Assumes the pLM predicts a logit score for each amino acid.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`expert_score`	`torch.Tensor`	of shape [parallel_chains]

BERTExpert

`evo_prot_grad.experts.bert_expert.BERTExpert`

Bases: ProteinLMExpert

Expert sub-class for BERT-style HuggingFace protein language models. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the BertForMaskedLM.bert.embeddings.word_embeddings layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Temperature for sampling from the expert.	required
`scoring_strategy`	`str`	Approach for scoring variants that the expert will use.	required
`model`	`nn.Module`	The model to use for the expert.	`None`
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use for the expert.	`None`
`device`	`str`	The device to use for the expert.	`'cpu'`

Raises:

Type	Description
`ValueError`	If either `model` or `tokenizer` is not specified.

`_get_last_one_hots() -> torch.Tensor`

Returns the one-hot tensors most recently passed as input.

Returns:

Name	Type	Description
`one_hots`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]

`tokenize(inputs) -> BatchEncoding`

Convert inputs to a format suitable for the model.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`batch_encoding`	`BatchEncoding`	A BatchEncoding object.

CausalLMExpert

`evo_prot_grad.experts.causallm_expert.CausalLMExpert`

Bases: ProteinLMExpert

Expert sub-class for autoregressive (causal) HuggingFace protein language models. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the AutoModelForCausalLM.transformer.embedding layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Temperature for sampling from the expert.	required
`scoring_strategy`	`str`	Approach for scoring variants that the expert will use.	required
`model`	`nn.Module`	The model to use for the expert. Defaults to AutoModelForCausalLM from lightonai/RITA_s.	`None`
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use for the expert. Defaults to AutoTokenizer from lightonai/RITA_s.	`None`
`device`	`str`	The device to use for the expert. Defaults to 'cpu'.	`'cpu'`

Raises:

Type	Description
`ValueError`	If either `model` or `tokenizer` is not specified.

`_get_last_one_hots()`

Returns the one-hot tensors most recently passed as input.

`tokenize(inputs: List[str]) -> BatchEncoding`

Convert inputs to a format suitable for the model.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`batch_encoding`	`BatchEncoding`	A BatchEncoding object.

EsmExpert

`evo_prot_grad.experts.esm_expert.EsmExpert`

Bases: ProteinLMExpert

Expert baseclass for HuggingFace protein language models from the ESM family. Implements abstract methods _get_last_one_hots and tokenize. Swaps out the EsmForMaskedLM.esm.embeddings.word_embeddings layer for a evo_prot_grad.common.embeddings.OneHotEmbedding layer.

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Temperature for sampling from the expert.	required
`scoring_strategy`	`str`	Approach for scoring variants that the expert will use.	required
`model`	`nn.Module`	The model to use for the expert. Defaults to EsmForMaskedLM from facebook/esm2_t6_8M_UR50D.	`None`
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use for the expert. Defaults to AutoTokenizer from facebook/esm2_t6_8M_UR50D.	`None`
`device`	`str`	The device to use for the expert. Defaults to 'cpu'.	`'cpu'`

Raises:

Type	Description
`ValueError`	If either `model` or `tokenizer` is not specified.

`_get_last_one_hots() -> torch.Tensor`

Returns the one-hot tensors most recently passed as input.

`tokenize(inputs: List[str]) -> BatchEncoding`

Convert inputs to a format suitable for the model.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`batch_encoding`	`BatchEncoding`	A BatchEncoding object.

AttributeExpert

`evo_prot_grad.experts.base_experts.AttributeExpert`

Bases: Expert

Interface for experts trained (typically with supervised learning) to predict an attribute (e.g., activity or stability) from one-hot encoded sequences. Implements abstract methods tokenize, get_model_output, __call__.

`init(temperature: float, model: nn.Module, scoring_strategy: str, device: str, tokenizer: Optional[tokenizers.ExpertTokenizer] = None)`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Hyperparameter for re-scaling this expert in the Product of Experts.	required
`model`	`nn.Module`	The model to use for the expert.	required
`scoring_strategy`	`str`	The approach used to score mutations with this expert.	required
`tokenizer`	`ExpertTokenizer`	The tokenizer to use for the expert.	`None`
`device`	`str`	The device to use for the expert.	required

`tokenize(inputs: List[str])`

Tokenizes a list of protein sequences.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequences.	required

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Returns both the onehot-encoded inputs and model's predictions.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`x_oh`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]
`attribute_values`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

`_get_last_one_hots() -> torch.Tensor`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`x_oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`score`	`torch.Tensor`	of shape [parallel_chains]

EVCouplingsExpert

`evo_prot_grad.experts.evcouplings_expert.EVCouplingsExpert`

Bases: Expert

Expert class for EVCouplings Potts models. EVCouplings lib uses the canonical alphabet by default.

Implements abstract methods _get_last_one_hots, tokenize, get_model_output, __call__.

`init(temperature: float, scoring_strategy: str, model: potts.EVCouplings, device: str, tokenizer: Optional[OneHotTokenizer] = None)`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Temperature for sampling from the expert.	required
`scoring_strategy`	`str`	Approach for scoring variants that the expert will use.	required
`model`	`potts.EVCouplings`	The model to use for the expert.	required
`device`	`str`	The device to use for the expert.	required
`tokenizer`	`Optional[OneHotTokenizer]`	The tokenizer to use for the expert. If None, uses OneHotTokenizer(utils.CANONICAL_ALPHABET, device).	`None`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Compute the wildtype-normalized Hamiltonian expert score.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`expert_score`	`torch.Tensor`	of shape [parallel_chains]

`init_wildtype(wt_seq: str) -> None`

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name	Type	Description	Default
`wt_seq`	`str`	The wildtype sequence.	required

OneHotDownstreamExpert

`evo_prot_grad.experts.onehot_downstream_regression_expert.OneHotDownstreamRegressionExpert`

Bases: AttributeExpert

Basic one-hot regression expert.

`init(temperature: float, scoring_strategy: str, model: Module, device: str, tokenizer: Optional[OneHotTokenizer] = None)`

Parameters:

Name	Type	Description	Default
`temperature`	`float`	Temperature for sampling from the expert.	required
`scoring_strategy`	`str`	Approach for scoring variants that the expert will use.	required
`model`	`Module`	The model to use for the expert.	required
`device`	`str`	The device to use for the expert.	required
`tokenizer`	`Optional[OneHotTokenizer]`	The tokenizer to use for the expert. If None, a OneHotTokenizer will be constructed. Defaults to None.	`None`

`init_wildtype(wt_seq: str) -> None`

Set the one-hot encoded wildtype sequence for this expert.

Parameters:

Name	Type	Description	Default
`wt_seq`	`str`	The wildtype sequence.	required

`tokenize(inputs: List[str])`

Tokenizes a list of protein sequences.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequences.	required

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Returns both the onehot-encoded inputs and model's predictions.

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`x_oh`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]
`attribute_values`	`torch.Tensor`	(torch.Tensor) of shape [parallel_chains, seq_len, vocab_size]

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

Parameters:

Name	Type	Description	Default
`inputs`	`List[str]`	A list of protein sequence strings of len [parallel_chains].	required

Returns:

Name	Type	Description
`x_oh`	`torch.Tensor`	of shape [parallel_chains, seq_len, vocab_size]
`score`	`torch.Tensor`	of shape [parallel_chains]

evo_prot_grad.experts

Expert

evo_prot_grad.experts.base_experts.Expert

__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str = 'cpu')

_get_last_one_hots() -> torch.Tensor abstractmethod

init_wildtype(wt_seq: str) -> None

tokenize(inputs: List[str]) -> Any abstractmethod

get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor] abstractmethod

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor] abstractmethod

ProteinLMExpert

evo_prot_grad.experts.base_experts.ProteinLMExpert

__init__(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str)

get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

BERTExpert

evo_prot_grad.experts.bert_expert.BERTExpert

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

_get_last_one_hots() -> torch.Tensor

tokenize(inputs) -> BatchEncoding

CausalLMExpert

evo_prot_grad.experts.causallm_expert.CausalLMExpert

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

_get_last_one_hots()

tokenize(inputs: List[str]) -> BatchEncoding

EsmExpert

evo_prot_grad.experts.esm_expert.EsmExpert

__init__(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')

_get_last_one_hots() -> torch.Tensor

tokenize(inputs: List[str]) -> BatchEncoding

AttributeExpert

evo_prot_grad.experts.base_experts.AttributeExpert

__init__(temperature: float, model: nn.Module, scoring_strategy: str, device: str, tokenizer: Optional[tokenizers.ExpertTokenizer] = None)

tokenize(inputs: List[str])

get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

_get_last_one_hots() -> torch.Tensor

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

EVCouplingsExpert

evo_prot_grad.experts.evcouplings_expert.EVCouplingsExpert

__init__(temperature: float, scoring_strategy: str, model: potts.EVCouplings, device: str, tokenizer: Optional[OneHotTokenizer] = None)

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

init_wildtype(wt_seq: str) -> None

OneHotDownstreamExpert

evo_prot_grad.experts.onehot_downstream_regression_expert.OneHotDownstreamRegressionExpert

__init__(temperature: float, scoring_strategy: str, model: Module, device: str, tokenizer: Optional[OneHotTokenizer] = None)

init_wildtype(wt_seq: str) -> None

tokenize(inputs: List[str])

get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

__call__(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]

`evo_prot_grad.experts.base_experts.Expert`

`init(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str = 'cpu')`

`_get_last_one_hots() -> torch.Tensor` `abstractmethod`

`init_wildtype(wt_seq: str) -> None`

`tokenize(inputs: List[str]) -> Any` `abstractmethod`

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]` `abstractmethod`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]` `abstractmethod`

`evo_prot_grad.experts.base_experts.ProteinLMExpert`

`init(temperature: float, model: nn.Module, vocab: Dict, scoring_strategy: str, device: str)`

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`evo_prot_grad.experts.bert_expert.BERTExpert`

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

`_get_last_one_hots() -> torch.Tensor`

`tokenize(inputs) -> BatchEncoding`

`evo_prot_grad.experts.causallm_expert.CausalLMExpert`

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

`_get_last_one_hots()`

`tokenize(inputs: List[str]) -> BatchEncoding`

`evo_prot_grad.experts.esm_expert.EsmExpert`

`init(temperature: float, scoring_strategy: str, model: Optional[nn.Module] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, device: str = 'cpu')`

`_get_last_one_hots() -> torch.Tensor`

`tokenize(inputs: List[str]) -> BatchEncoding`

`evo_prot_grad.experts.base_experts.AttributeExpert`

`init(temperature: float, model: nn.Module, scoring_strategy: str, device: str, tokenizer: Optional[tokenizers.ExpertTokenizer] = None)`

`tokenize(inputs: List[str])`

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`_get_last_one_hots() -> torch.Tensor`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`evo_prot_grad.experts.evcouplings_expert.EVCouplingsExpert`

`init(temperature: float, scoring_strategy: str, model: potts.EVCouplings, device: str, tokenizer: Optional[OneHotTokenizer] = None)`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`init_wildtype(wt_seq: str) -> None`

`evo_prot_grad.experts.onehot_downstream_regression_expert.OneHotDownstreamRegressionExpert`

`init(temperature: float, scoring_strategy: str, model: Module, device: str, tokenizer: Optional[OneHotTokenizer] = None)`

`init_wildtype(wt_seq: str) -> None`

`tokenize(inputs: List[str])`

`get_model_output(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`

`call(inputs: List[str]) -> Tuple[torch.Tensor, torch.Tensor]`