The field of Computational Protein Engineering is advancing at an amazing pace. Here is a summary of the types of models, some examples, and what they try to predict.
P(structure | sequence) Models - Structure Prediction Models
Given a sequence of amino acids, these models attempt to predict the three dimensional structure that the protein sequence will fold into. This is a slightly under-specified problem because the conditions in the cell, chaperones, and other proteins nearby have a big impact on how proteins end up. There is no question however, that these models have revolutionized our understanding of proteins and cell biology.
Proteins don’t live alone so these models have also been extended / hacked to support protein complexes. A common technique here is to concatenate multiple sequences together separated by a ~20-50bp linker sequence composed of glycine or proline residues.
MSA Search Required
In this author’s naive opinion, the big differentiator in approach between structure prediction models is whether the model requires a multiple sequence alignment (MSA) as input or not. The first models (ie AlphaFold2) were MSA based in that they take a target sequence and then perform a search against UniRef90 or another sequence databases to find several hundred related sequences. These sequences are aligned and then the alignment is used as input to the model. The goal of this search is to capture evolutionary conserved regions with the thought being that conserved parts of the sequence correspond to conserved parts of the 3d structure.
AlphaFold2
- Release Date: July, 2021
- Creators: DeepMind
- Original Paper: Highly accurate protein structure prediction with AlphaFold
AlphaFold2 (there was a AlphaFold 1 that didn’t work so great.) is arguable one of the greatest scientific advances of recent times. It blew all the other competitors of the yearly Critical Assessment of Protein Structure Prediction (CASP14) competition out of the water and kicked off the recent renaissance in protein prediction (and design).
The AlphaFold network process information through two “tracks”- sequence and MSA(?). It also introduces the concept of “Recycling” where the output of some layers of the network are fed back in. This improve the quality of the predictions, but also introduced issues related to the network not being fully “differentiable”, a problem that later models and approaches will attempt to ameliorate.
RoseTTAFold
- Release Date: August 2021
- Creators: Minkyung Baek at the Baker Lab at University of Washington
- Original Paper: Accurate prediction of protein structures and interactions using a three-track neural network
RoseTTAFold is similar to AlphaFold2, but has a “Three Track Network” rather than AlphaFold’s “Two Track” network. Like Alphafold2, RoseTTAfold has a track for one-dimensional (1D) sequence level information, the 2D distance map level, and also introduces a 3D coordinate level.
In addition, I believe the key advantage to RoseTTAFold which makes is useful for integration into protein engineering workflows is that it is fully “differentiable”. As a result, it is used as the structure prediction sub-model in RFDiffusion (see below).
No MSA
The multiple sequence alignment (MSA) based models were first to crack the structure prediction problem with accuracies that made their results useful for the larger scientific community. This “evolutionary” information proved really helpful largely because in 2021, there were only ~200k solved 3D protein structures that could be used as training data. This was insufficient and tiny compared to the training set sizes being employed by other deep learning models (like GPT) applied to other domains such as text generation and machine translation. As a result, these initial models used MSA alignments and what I would describe as more rationally designed embeddings and data pre-processing to constrain the prediction problem and make the problem more tractable with limited data. The Non-MSA models are typically dramatically faster than the MSA-based models.
OmegaFold
- Paper: High-resolution de novo structure prediction from primary sequence
- July 2022
I discovered OmegaFold after ESMFold personally, but it seems they were the first to propose a non-MSA structure prediction network.
ESMFold
- Release Date: March 2023
- Creators: Facebook Research
- Paper: Evolutionary-scale prediction of atomic-level protein structure with a language model
- Free use & api: ESM Atlas
- Run it in Google Colab
This makes the model dramatically faster and saves a huge amount of pre-computation and database searching for homologous sequences. In contrast to the two and three track networks of AlphaFold2 and RoseTTAFold, the ESM model is a directly trained transformer protein language model.
TODO For the Author: I’d like to understand better the pros/cons and differences of OmegaFold and ESMFold. Why do tools like RFDiffusion still use RoseTTAFold?
Rosetta - Historically interesting
Release Date: 1997 Creators: Baker Lab at University of Washington Paper:
Long before the deep learning craze the Baker Lab and others were tackling protein structure prediction and engineering using Monte Carlo Sampling. They directed the search and constrained the search space using a number of scoring function consisting of physics-based and knowledge-based terms. It seems they first applied their methods way back to CASP 3 challenge. The first paper was published in 1997. The package turned into more of a library which could be extended for many protein structure prediction adjacent tasks. There was also an offshoot in Rosetta@home. Rosetta eventually evolved to be part of a the Rosetta Consortium which involved multiple academic groups around the world, but the Baker group moved on into Deep Learning which is where RoseTTAFold came from.
Work in Progress Below!
I have a lot more to learn about this topic. I will update below as I better understand other tasks in protein design.