ESM-2 Info#

Zeeshan Siddiqui

Oct 18, 2023

6 min read

This page explains the use of ESM-2 for generating embeddings, contacts, attentions, and predicting logits and how to access these capabilities with the the BioLM API.

Model Background#

ESM-2 is an expanded transformer-based protein language model that achieves state-of-the-art performance across diverse protein modeling applications compared to previous models like ESM-1v. As described by Lin et al., (2022), “The resulting ESM-2 model family significantly outperforms previously state-of-the-art ESM-1b (a ∼650 million parameter model) at a comparable number of parameters, and on structure prediction benchmarks it also outperforms other recent protein language models.” ESM-2 was pretrained on the full UniRef50 dataset, comprising 200 million sequences and 120 billion amino acid residues, drastically larger than ESM-1v’s subset. The model architecture itself is also much larger, with 33 transformer layers and 1.6 billion parameters, versus 12 layers and 128 million parameters in ESM-1v. To enable training such a large model, Lin et al. (2022) utilized a multi-node setup with per-token batch sizes up to 3.2 million, exploiting the capability of transformer models to leverage large batches. The model architecture applies sparsely-gated mixture-of-experts rather than standard transformers, alongside a multi-task pretraining approach combining language modeling with supervised auxiliary losses. These architectural improvements and training strategies enable ESM-2 to produce superior sequence representations compared to previous models like ESM-1v, providing new state-of-the-art capabilities for predictive modeling tasks in protein science.

Embeddings#

Large language models (LLMs) like ESM-2 can generate informative and static-length feature representations of protein sequences. Such models’ predictive embeddings encode relevant biological properties, and are use to make classification, regression, and other predictions like protein folds. Embeddings are vector representations that can be extracted and utilized as inputs for a variety of downstream predictive modeling tasks as an alternative to standard one-hot sequence encodings.

In biology, feature engineering is often heavily tailored to each application. However, the embeddings from pretrained language models provide broadly useful representations across a multitude of applications, such as toxicity prediction, functional likelihood, thermodynamic properties, structure, and more.

The BioLM API provides easy access to ESM-2 for generating insightful protein embeddings to model experimental sequences and data. This service accelerates tasks like sequence similarity detection, therapeutic antibody design, simplifying the transition from protein sequence data to actionable insights. By cost-optimizing the infrastructure to compute a variety of protein embeddings, our API lowers barriers to leveraging advanced LLMs, accelerating the development of predictive tools from protein sequences.

Masked Predictions#

ESM-2 was trained with a masked language modeling objective (predicting what the most likely amino acid for a masked position is). This capability is can be used to predict the functional likelihood of amino acids at all or specific positions of interest during protein sequence design or modification.

Applications of ESM-2#

The powerful protein sequence embeddings generated by ESM-2 have wide-ranging applications in protein science. They can aid in predicting protein-protein interactions and designing proteins with specified binding capabilities. Additionally, ESM-2 embeddings facilitate functional annotation of uncharacterized or novel proteins, expanding knowledge of the protein universe. The embeddings can also be leveraged to anticipate the effects mutations have on protein function and stability, critical for protein design efforts. In drug discovery, they assist target identification by revealing structural and functional similarities with known drug targets. Finally, the high-dimensional sequence representations from ESM-2 expedite comparative analysis of proteins by illuminating conserved domains and regions of interest. This is pivotal for elucidating evolutionary relationships and shared functional attributes among protein families

  • Enzyme engineering (enzyme optimization, transfer learning, directed evolution).

  • Antibody engineering (machine learning models applied on antibody embeddings may predict affinity, expression, stability without lab assays).

  • Protein-protein interaction design - Embeddings can be used to engineer proteins that interact with specific targets, like designing cellular signaling proteins.

  • Membrane protein design.

BioLM Benefits#

  • Always-on, auto-scaling GPU-backed APIs (Status Page); highly-scalable parallelization.

  • Save money on infrastructure, GPU costs, and development time.

  • Quickly integrate multiple embeddings and ESM-2 features into your workflows.

  • Use our Chat Agents and other Web Apps to interact with bio-LLMs using no code.