A deep unsupervised Model for Protein Design

Principal Investigators:
Dr. Noelia Ferruz
Project Manager:
Dr. Noelia Ferruz
HPC Platform used:
Project ID:
Date published:
Dr. Steffen Schmidt, Prof. Birte Höcker
The design of new functional proteins can tackle many of the problems humankind is facing today but so far has proven very challenging1. Analogies between protein sequences and human languages have been long noted and a summary of their most prominent similarities is described. Given the tremendous success of Natural Language Processing (NLP) methods in recent years, its application to protein research opens a fresh perspective, shifting from the current energy-function centered paradigm to an unsupervised learning approach based entirely on sequences. To explore this opportunity further we have pre-trained a generative language model on the entire protein sequence space. We find that our language model, ProtGPT2, effectively speaks the protein language and can generate de-novo sequences with natural properties in a matter of seconds.

2. Methods
ProtGPT2 is a Transformer architecture based on the original transformer's decoder3,4. The training dataset is UniRef50 (2021_04), containing 49.8 M sequences (split 90/10) tokenized using the BPE algorithm. ProtGPT2 pre-training used 128 A100 GPUs in four days using Adam optimization (β1 = 0.9, β2 = 0.999), a learning rate of 1e-03, and 65,536 tokens per batch (with a batch size per device of 8). Parallelism during training was handled with DeepSpeed.

3. Results
3.1. ProtGPT2 encodes de novo globular proteins
ProtGPT2 generates sequences related to natural ones, shown by identity vs length plots (Figure 2). With an average identity of 48% to natural sequences, ProtGPT2 sequences are novel and only distantly related to the protein space. However, the disordered and secondary structural content of ProtGPT2 sequences show similar α-helical, coiled and β-sheet content (48.64%, 39.70%, and 11.66%) - in line with natural sequences (45.19%, 41.87% and 12.93%). Furthermore, the disordered and secondary structure content analysis revealed a similar number of structured elements among the ProtGPT2 generated sequences (87.59%) and natural sequences (88.40%).

3.2 ProtGPT2 extends the boundaries of the current protein space
We predicted the structures of 10,000 generated ProtGPT2 sequences using AlphaFold6. In de novo protein design, it is essential that sequences fold into stable, ordered structures. We observe that the mean pLDDT of the dataset is 63.2 when taking the best-scoring structure per sequence and 59.6 when averaging across all five predictions per sequence, suggesting that ProtGPT2 structures are well-folded. Examples of ProtGPT2's structures are shown in Figure 3. ProtGPT2 can design the exceptionally challenging cases of beta structures and membrane proteins (Fig. 3a,b). However, the most remarkable properties of ProtGPT2 sequences are that they resemble the complexity of natural proteins, with multifaceted surfaces capable of allocating interacting molecules and substrates, thus paving the way for functionalization (Fig. 3c).

The design of tailored proteins has enormous potential to solve biomedical and environmental problems but remains a challenging task. Given the similarities between natural languages and proteins, the application of NLP methods opens a new paradigm in protein research. We have trained an unsupervised autoregressive language model, ProtGPT2, that effectively speaks the protein language. Despite its extensive training time, ProtGPT2 only needs to be trained once and can be freely used by the community. ProtGPT2 can generate in seconds de novo proteins that are distant to natural proteins while preserving their qualities, which constitutes an enormous step towards efficient design of proteins.
1.    Lechner, H., Ferruz, N. & Höcker, B. Strategies for designing non-natural enzymes and binders. Current Opinion in Chemical Biology vol. 47 67–76 (2018).
2.    Ferruz, N. & Höcker, B. Towards Controllable Protein design with Conditional Transformers. (2022). arXiv 2201.07338 (2022) doi: 10.48550/arVi.2201.07338.
3.    Vaswani, A. et al. Transformer: Attention is all you need. in Advances in Neural Information Processing Systems vols 2017-Decem 5999–6009 (2017).
4.    Radford, A. et al. Language Models are Unsupervised Multitask Learners. https://github.com/codelucas/newspaper.
5.    Ferruz, N., Schmidt, S. & Höcker, B. A deep unsupervised language model for protein design. bioRxiv 2022.03.09.483666 (2022) doi:10.1101/2022.03.09.483666.
6.    Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nat. 2021 5967873 596, 583–589 (2021).

Institute / Institutes:
Institut für Biochemie/NWIII AG Höcker
Universität Bayreuth
Figure 2: Pairwise sequence identities vs alignment length for a natural (A) ProtGPT2 (B) and random (C) datasets against the UniClust30 database. The HSSP curve shown in each plot is used as reference to compare all three datasets. While natural (A) and protGPT2 (B) sequences show similar percentages below the curve, 93% of the sequences in the random dataset (C) do not have significantly similar sequences. Further details in [5].