Skip to content

Commit

Permalink
Touching up documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Patrick Emami committed Jul 10, 2024
1 parent ad8958b commit c74f33d
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 18 deletions.
10 changes: 6 additions & 4 deletions docs/getting_started/Trying_out_EvoProtGrad.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ import evo_prot_grad
Create a `ProtBERT` expert from a pretrained 🤗 HuggingFace protein language model (PLM) using `evo_prot_grad.get_expert`:

```python
prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0, device = 'cuda')
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
```
The default BERT-style PLM in `EvoProtGrad` is `Rostlab/prot_bert`. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. For masked language models like `prot_bert`, we score variant sequences with the sum of amino acid log probabilities by default.
The default BERT-style PLM in `EvoProtGrad` is `Rostlab/prot_bert`. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. For protein language models like `prot_bert`, we have implemented two scoring strategies: `pseudolikelihood_ratio` and `mutant_marginal`. The `pseudolikelihood_ratio` strategy computes the ratio of the "pseudo" log-likelihood (this isn't the exact log-likelihood when the protein language model is a *masked* language model) of the wild type and mutant sequence.

Then, we create an instance of `DirectedEvolution` and run the search, returning a list of the best variant per Markov chain (as measured by the `prot_bert` expert):

Expand All @@ -29,7 +29,7 @@ This class implements PPDE, the gradient-based discrete MCMC sampler introduced

### Specifying the model and tokenizer

To load a HuggingFace expert with a specific model and tokenizer, provide them as arguments to `get_expert`:
To load a HuggingFace expert with a specific model and tokenizer, provide them as arguments to [`evo_prot_grad.get_expert`](https://nrel.github.io/EvoProtGrad/api/evo_prot_grad/#get_expert):

```python
from transformers import AutoTokenizer, EsmForMaskedLM
Expand All @@ -38,6 +38,7 @@ esm2_expert = evo_prot_grad.get_expert(
'esm',
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
scoring_strategy = 'mutant_marginal',
temperature = 1.0,
device = 'cuda')
```
Expand All @@ -51,13 +52,14 @@ You can compose multiple experts by passing multiple experts to `DirectedEvoluti
import evo_prot_grad
from transformers import AutoModel

prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0, device = 'cuda')
prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')

# onehot_downstream_regression are experts that predict a downstream scalar property
# from a one-hot encoding of the protein sequence
fluorescence_expert = evo_prot_grad.get_expert(
'onehot_downstream_regression',
temperature = 1.0,
scoring_strategy = 'attribute_value',
model = AutoModel.from_pretrained('NREL/avGFP-fluorescence-onehot-cnn',
trust_remote_code=True),
device = 'cuda')
Expand Down
27 changes: 13 additions & 14 deletions docs/getting_started/experts.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,33 +33,33 @@ $$
\log P(X) = \log F(X) + \lambda \log G(X) - \log Z.
$$

In `EvoProtGrad`, the "score" of each expert corresponds to either $\log F(X)$ or $\log G(X)$ here. In most cases, we interpret the scalar output of a neural network as the score.
In `EvoProtGrad`, the **score**"** of each expert corresponds to either $\log F(X)$ or $\log G(X)$ here. In most cases, we interpret the scalar output of a neural network as the score.
The magic of the Product of Experts formulation is that it enables us to compose arbitrary numbers of experts, essentially allowing us to "plug and play" with different experts to guide the search.

In actuality, instead of just searching for a protein variant that maximizes $P(X)$, `EvoProtGrad` uses gradient-based discrete MCMC to *sample* from $P(X)$.
MCMC is necessary for sampling from $P(X)$ because it is impractical to compute the partition function $Z$ exactly.
Uniquely to `EvoProtGrad`, as long *all* experts are *differentiable*, our sampler can use the gradient of $\log F(X) + \lambda \log G(X)$ with respect to the one-hot protein $X$ to identify the most promising mutation to apply to $X$, which vastly speeds up MCMC convergence.

## 🤗 HuggingFace Transformers
## 🤗 HuggingFace Protein Language Models (PLMs)

`EvoProtGrad` provides a convenient interface for defining and using experts from the HuggingFace Hub.
To use pretrained PLMs from the HuggingFace Hub with gradient-based discrete MCMC, we swap out each transformer's token embedding layer for a custom [one-hot token embedding layer](https://nrel.github.io/EvoProtGrad/api/common/embeddings/#onehotembedding). This enables us to compute and access gradients with respect to one-hot input protein sequences.
In detail, we modify pretrained PLMs from the HuggingFace Hub to use with gradient-based discrete MCMC by hot-swapping the Transformer's token embedding layer for a custom [one-hot embedding layer](https://nrel.github.io/EvoProtGrad/api/common/embeddings/#onehotembedding). This enables us to compute and access gradients with respect to one-hot protein sequences.

We provide a baseclass `evo_prot_grad.experts.base_experts.ProteinLMExpert` which is subclassed to support various types of HuggingFace PLMs. Currently, we provide three subclasses for
We provide a baseclass `evo_prot_grad.experts.base_experts.ProteinLMExpert` which can be subclassed to support various types of HuggingFace PLMs. Currently, we provide three subclasses for

- BERT-style PLMs (`evo_prot_grad.experts.bert_expert.BertExpert`)
- CausalLM-style PLMs (`evo_prot_grad.experts.causallm_expert.CausalLMExpert`)
- ESM-style PLMs (`evo_prot_grad.experts.esm_expert.EsmExpert`)

Each HuggingFace PLM expert has to specify the model and tokenizer to use. Defaults for each type of PLM are provided.
To instantiate EvoProtGrad ProteinLMExperts, we provide a simple function [`evo_prot_grad.get_expert`](https://nrel.github.io/EvoProtGrad/api/evo_prot_grad/#get_expert). The name of the expert, the variant scoring strategy, and the temperature for scaling the expert score must be provided. We provide defaults for the other arguments to `get_expert`.

For example, an ESM2 expert can be instantiated with `evo_prot_grad.get_expert` with only:
For example, an ESM2 expert can be instantiated with:

```python
esm2_expert = evo_prot_grad.get_expert('esm', temperature = 1.0, device = 'cuda')
esm2_expert = evo_prot_grad.get_expert('esm', scoring_strategy = 'mutant_marginal', temperature = 1.0, device = 'cuda')
```

using the default model `EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")` and tokenizer `AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")`.
which uses the default ESM2 model `EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")` and tokenizer `AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")`.

To load the ESM2 expert with a specific model and tokenizer, provide them as arguments to `get_expert`:

Expand All @@ -70,6 +70,7 @@ esm2_expert = evo_prot_grad.get_expert(
'esm',
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D"),
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D"),
scoring_strategy = 'mutant_marginal',
temperature = 1.0,
device = 'cuda')
```
Expand All @@ -94,6 +95,7 @@ evcouplings_model = EVCouplings(
evcouplings_expert = get_expert(
'evcouplings',
temperature = 1.0,
scoring_strategy = 'attribute_value',
model = evcouplings_model)
```

Expand All @@ -110,6 +112,7 @@ onehotcnn_model = AutoModel.from_pretrained(
regression_expert = get_expert(
'onehot_downstream_regression',
temperature = 1.0,
scoring_strategy = 'attribute_value',
model = onehotcnn_model)
```

Expand All @@ -125,16 +128,12 @@ onehotcnn_model.load_state_dict(torch.load('onehotcnn.pt'))
regression_expert = get_expert(
'onehot_downstream_regression',
temperature = 1.0,
scoring_strategy = 'attribute_value',
model = onehotcnn_model)
```

## Choosing the Expert Temperature

The expert temperature $\lambda$ controls the relative importance of the expert in the Product of Experts. By default it is set to 1.

!!! note

By default, the expert's score for a variant is normalized by the wild type score, i.e., we subtract the wild type score from the variant score. This is to ensure that each expert in the Product of Experts is centered around 0. If you want to use the raw expert score, set `use_without_wildtype = True` when instantiating the expert.

If using wild type centering, then we recommend first trying temperatures of 1.0 for each $\lambda$.
We recommend first trying temperatures of 1.0 for each $\lambda$, and checking whether each expert score is within the same order of magnitude as the other experts. If one expert's scores are much larger or smaller than the others, you may need to adjust the temperature to balance the experts.
We describe a simple heuristic for selecting $\lambda$ via a grid search using a small dataset of labeled variants in Section 5.2 of our [paper](https://doi.org/10.1088/2632-2153/accacd).
2 changes: 2 additions & 0 deletions evo_prot_grad/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ def get_expert(expert_name: str,
expert_name = 'esm',
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t36_3B_UR50D"),
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t36_3B_UR50D"),
scoring_strategy = 'mutant_marginal',
temperature = 1.0,
device = 'cuda'
)
```
Expand Down

0 comments on commit c74f33d

Please sign in to comment.