Skip to content

evo_prot_grad.common.sampler

DirectedEvolution

evo_prot_grad.common.sampler.DirectedEvolution

Main class for plug and play directed evolution with gradient-based discrete MCMC.

__init__(experts: List[Expert], parallel_chains: int, n_steps: int, max_mutations: int, output: str = 'last', preserved_regions: Optional[List[Tuple[int, int]]] = None, wt_protein: Optional[str] = None, wt_fasta: Optional[str] = None, verbose: bool = False, random_seed: Optional[int] = None)

Parameters:

Name Type Description Default
experts List[Expert]

List of experts

required
parallel_chains int

number of parallel chains

required
n_steps int

number of steps to run directed evolution

required
max_mutations int

maximum mutation distance from WT, disable by setting to -1.

required
output str

output type, either 'best', 'last' or 'all'. Default is 'last'.

'last'
preserved_regions List[Tuple[int, int]]

list of tuples of (start, end) of preserved regions. Default is None.

None
wt_protein str

wt sequence as a string. Must provide one of wt_protein or wt_fasta.

None
wt_fasta str

path to fasta file containing wt sequence. Must provide one of wt_protein or wt_fasta.

None
verbose bool

whether to print verbose output. Default is False.

False
random_seed int

random seed for reproducibility. Default is None.

None

Raises:

Type Description
ValueError

if n_steps < 1.

ValueError

if neither wt_protein nor wt_fasta is provided.

ValueError

if a fasta file is passed to wt_protein argument.

ValueError

if output is not one of 'best', 'last' or 'all'.

ValueError

if no experts are provided.

ValueError

if any of the preserved regions are < 1 amino acid long.

reset()

Initialize the parallel chains of protein sequences.

_prepare_results(variants, scores, n_seqs_to_keep = None)

Prepare the results by sorting and selecting the top sequences.

Parameters:

Name Type Description Default
variants List[str]

The list of sequence variants. Shape is (n_steps, parallel_chains).

required
scores np.ndarray

The scores for the sequence variants. Shape is (n_steps, parallel_chains).

required
n_seqs_to_keep int

Number of sequences to keep. Default is None (keep all).

None

Returns:

Type Description
pd.DataFrame

DataFrame of results.

_get_variants_and_scores() -> Tuple[List[str], np.ndarray]

Get the variants and scores based on the output type.

_product_of_experts(inputs: List[str]) -> Tuple[List[torch.Tensor], torch.Tensor]

Compute the product of experts. Computes each expert score, multiplies it by the expert temperature, and aggregates the scores by summation.

Parameters:

Name Type Description Default
inputs List[str]

list of protein sequences of len [parallel_chains]

required

Returns:

Name Type Description
ohs List[torch.Tensor]

list of one-hot encoded sequences of len [parallel_chains]

PoE torch.Tensor

product of experts score of shape [parallel_chains]

_compute_gradients(ohs: List[torch.Tensor], PoE: torch.Tensor) -> torch.Tensor

Compute the gradients of the product of experts with respect to the one-hots. We put each expert's amino acid alphabet, used to construct one-hot inputs, in a canonical order before summing gradients together.

Parameters:

Name Type Description Default
ohs List[torch.Tensor]

tensor one-hot embeddings of shape [parallel_chains, seq_len, vocab_size]. List is of length # experts

required
PoE torch.Tensor

product of experts score of shape [parallel_chains]

required

Returns:

Name Type Description
grads torch.Tensor

gradients of the product of experts with respect to the one-hots.

save_results(csv_filename: str, variants: Optional[List[str]] = None, scores: Optional[np.ndarray] = None, n_seqs_to_keep: int = 10000) -> None

Save the output sequences and scores to a CSV file. Also saves the params used to run the sampler in a _params.txt file.

Parameters:

Name Type Description Default
csv_filename str

Filename for saving the results. Ends in .csv.

required
variants list of list of str

The list of sequence variants.

None
scores torch.Tensor

The scores for the sequence variants.

None
n_seqs_to_keep int

Number of sequences to keep in the results. Default is 10000.

10000
__call__() -> Tuple[List[str], np.ndarray]

Run the gradient-based MCMC sampler.

Returns:

Name Type Description
variants List[str]

list of protein sequences

scores np.ndarray

the product of expert scores for the variants