Trying out EvoProtGrad
First, import our library:
Create a ProtBERT
expert from a pretrained 🤗 HuggingFace protein language model (PLM) using evo_prot_grad.get_expert
:
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
EvoProtGrad
is Rostlab/prot_bert
. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. For protein language models like prot_bert
, we have implemented two scoring strategies: pseudolikelihood_ratio
and mutant_marginal
. The pseudolikelihood_ratio
strategy computes the ratio of the "pseudo" log-likelihood (this isn't the exact log-likelihood when the protein language model is a masked language model) of the wild type and mutant sequence.
Then, we create an instance of DirectedEvolution
and run the search, returning a list of the best variant per Markov chain (as measured by the prot_bert
expert):
variants, scores = evo_prot_grad.DirectedEvolution(
wt_fasta = 'test/gfp.fasta', # path to wild type fasta file
output = 'best', # return best, last, all variants
experts = [prot_bert_expert], # list of experts to compose
parallel_chains = 1, # number of parallel chains to run
n_steps = 20, # number of MCMC steps per chain
max_mutations = 10, # maximum number of mutations per variant
verbose = True # print debug info to command line
)()
This class implements PPDE, the gradient-based discrete MCMC sampler introduced in our paper.
Specifying the model and tokenizer
To load a HuggingFace expert with a specific model and tokenizer, provide them as arguments to evo_prot_grad.get_expert
:
from transformers import AutoTokenizer, EsmForMaskedLM
esm2_expert = evo_prot_grad.get_expert(
'esm',
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
scoring_strategy = 'mutant_marginal',
temperature = 1.0,
device = 'cuda')
Composing 2+ Experts
You can compose multiple experts by passing multiple experts to DirectedEvolution
as a list. As an example, we provide a ConvNet-based expert that predicts the fluorescence of GFP variants in the HuggingFace Hub:
import evo_prot_grad
from transformers import AutoModel
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
# onehot_downstream_regression are experts that predict a downstream scalar property
# from a one-hot encoding of the protein sequence
fluorescence_expert = evo_prot_grad.get_expert(
'onehot_downstream_regression',
temperature = 1.0,
scoring_strategy = 'attribute_value',
model = AutoModel.from_pretrained('NREL/avGFP-fluorescence-onehot-cnn',
trust_remote_code=True),
device = 'cuda')
variants, scores = evo_prot_grad.DirectedEvolution(
wt_fasta = 'test/gfp.fasta',
output = 'best',
experts = [prot_bert_expert, fluorescence_expert],
parallel_chains = 1,
n_steps = 100,
max_mutations = 10
)()