Skip to content

gersteinlab/GenAI4Drug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 

Repository files navigation

A Survey of Generative AI for de novo Drug Design

Update: our paper has been accepted for Briefings in Bioinformatics!

Repository for the survey paper "A Survey of Generative AI for de novo Drug Discovery: New Frontiers in Molecule and Protein Design".

Xiangru Tang1*, Howard Dai1*, Elizabeth Knight1*, Yunyang Li1, Fang Wu2, Tianxiao Li1, Mark Gerstein1

1. Yale University; 2. Stanford University
(*: Equal Contribution)

Table of Contents

[**] denotes appendix sections.

Section Subsection Datasets Metrics Models
Molecule Target-Agnostic Generation Datasets Metrics Models
Molecule Target-Aware Generation Datasets Metrics Models
Molecule Conformation Generation** Datasets Metrics Models
Protein Representation Learning** Datasets Models
Protein Structure Prediction Datasets Metrics Models
Protein Sequence Generation Datasets Metrics Models
Protein Backbone Design Datasets Metrics Models
Antibody Representation Learning** Datasets Models
Antibody Structure Prediction** Datasets Metrics Models
Antibody CDR Generation** Datasets Metrics Models
Peptide Misc. Tasks** Models

Cite us

@article{tang2024survey,
  title={A survey of generative ai for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  year={2024},
  publisher={Oxford Academic}
}

Overview of Topics

An overview of topics covered in our paper. Sections highlighted in blue can be found in the main text, while purple sections are extended sections found in the appendix.


generative AI for drug design

Molecule

Target-Agnostic Generation

Datasets

  • Quantum chemistry structures and properties of 134 kilo molecules (QM9)
    Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld
    Scientific Data (2014)

  • GEOM, energy-annotated molecular conformations for property prediction and molecular generation (GEOM)
    Simon Axelrod, Rafael Gómez-Bombarelli
    Scientific Data (2022)

Metrics

  • Quantifying the chemical beauty of drugs (QED)
    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, Andrew L Hopkins
    Nature Chemistry (2012)

Models

  • Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules (CVAE)
    Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, JoséMiguel Hernández-Lobato, BenjamínSánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik
    ACS Central Science (2018)

  • Grammar Variational Autoencoder (GVAE)
    Matt J. Kusner, Brooks Paige, José Miguel Hernández-Lobato
    ICML 2017

  • Syntax-Directed Variational Autoencoder for Structured Data (SD-VAE)
    Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song
    ICLR 2018

  • Junction Tree Variational Autoencoder for Molecular Graph Generation (JT-VAE)
    Wengong Jin, Regina Barzilay, Tommi Jaakkola
    ICML 2018

  • E(n) Equivariant Normalizing Flows (E-NF)
    Victor Garcia Satorras, Emiel Hoogeboom, Fabian Fuchs, Ingmar Posner, Max Welling
    NeurIPS 2021

  • Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules (G-SchNet)
    Niklas Gebauer, Michael Gastegger, Kristof Schütt
    NeurIPS 2019

  • Equivariant Diffusion for Molecule Generation in 3D (EDM)
    Emiel Hoogeboom, Vı́ctor Garcia Satorras, Clément Vignac, Max Welling
    ICML 2022

  • Geometry-Complete Diffusion for 3D Molecule Generation and Optimization (GCDM)
    Alex Morehead, Jianlin Cheng
    arXiv:2302.04313 (2023)

  • MDM: Molecular Diffusion Model for 3D Molecule Generation (MDM)
    Lei Huang, Hengtong Zhang, Tingyang Xu, Ka-Chun Wong
    AAAI 2023

  • Geometric Latent Diffusion Models for 3D Molecule Generation (GeoLDM)
    Minkai Xu, Alexander S Powers, Ron O. Dror, Stefano Ermon, Jure Leskovec
    ICML 2023

  • Learning Joint 2D & 3D Diffusion Models for Complete Molecule Generation (JODO)
    Han Huang, Leilei Sun, Bowen Du, Weifeng Lv
    arXiv:2305.12347 (2023)

  • MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation (MiDi)
    Clement Vignac, Nagham Osman, Laura Toni, Pascal Frossard
    arXiv:2302.09048 (2023)

Target-Aware Generation

Datasets

  • Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design (CrossDocked2020)
    Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, David R. Koes
    ACS JCIM 2020

  • ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery (ZINC20)
    John J. Irwin, Khanh G. Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R. Wong, Munkhzul Khurelbaatar, Yurii S. Moroz, John Mayfield, Roger A. Sayle
    ACS JCIM 2020

  • Binding MOAD (Mother Of All Databases) (Binding MOAD)
    Liegi Hu, Mark L. Benson, Richard D. Smith, Michael G. Lerner, Heather A. Carlson
    Proteins 2005

Metrics

  • AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading (Vina AutoDock)
    Oleg Trott, Arthur J. Olson
    JCC 2010

  • Quantifying the chemical beauty of drugs (QED) G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, Andrew L Hopkins
    Nature Chemistry (2012)

  • Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions (SAScore)
    Peter Ertl, Ansgar Schuffenhauer Journal of Cheminformatics 2009

Models

  • DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins (DrugGPT)
    Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han
    bioRxiv (2023)

  • Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models (LiGAN)
    Tomohide Masuda, Matthew Ragoza, David Ryan Koes
    arXiv:2010.14442 (2020)

  • Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets (Pocket2Mol)
    Xingang Peng, Shitong Luo, Jiaqi Guan, Qi Xie, Jian Peng, Jianzhu Ma
    ICML 2022

  • A 3D Generative Model for Structure-Based Drug Design
    Shitong Luo, Jiaqi Guan, Jianzhu Ma, Jian Peng
    NeurIPS 2021

  • 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (TargetDiff)
    Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, Jianzhu Ma
    ICLR 2023

  • Structure-based Drug Design with Equivariant Diffusion Models (DiffSBDD)
    Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia
    arXiv:2210.13695 (2022)

Conformation Generation (appendix)

Datasets

  • GEOM, energy-annotated molecular conformations for property prediction and molecular generation (GEOM)
    Simon Axelrod, Rafael Gómez-Bombarelli
    Scientific Data 2022

  • SchNet: A continuous-filter convolutional neural network for modeling quantum interactions (ISO17)
    Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, Klaus-Robert Müller
    NeurIPS 2017

Metrics

  • Learning Neural Generative Dynamics for Molecular Conformation Generation (Coverage, Matching)
    Minkai Xu, Shitong Luo, Yoshua Bengio, Jian Peng, Jian Tang
    ICLR 2021

Models

  • Molecular Geometry Prediction using a Deep Generative Graph Neural Network (CVGAE)
    Elman Mansimov, Omar Mahmood, Seokho Kang, Kyunghyun Cho
    Scientific Reports 2019

  • A Generative Model for Molecular Distance Geometry (GraphDG)
    Gregor N. C. Simm, Jose Miguel Hernandez-Lobato
    ICML 2020

  • Learning Neural Generative Dynamics for Molecular Conformation Generation (CGCF)
    Minkai Xu, Shitong Luo, Yoshua Bengio, Jian Peng, Jian Tang
    ICLR 2021

  • GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles (GeoMol)
    Octavian Ganea, Lagnajit Pattanaik, Connor Coley, Regina Barzilay, Klavs Jensen, William Green, Tommi Jaakkola
    NeurIPS 2021

  • Learning Gradient Fields for Molecular Conformation Generation (ConfGF)
    Chence Shi, Shitong Luo, Minkai Xu, Jian Tang
    ICML 2021

  • Predicting Molecular Conformation via Dynamic Graph Score Matching (DGSM)
    Shitong Luo, Chence Shi, Minkai Xu, Jian Tang
    NeurIPS 2021

  • GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (GeoDiff)
    Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, Jian Tang
    ICLR 2022

Protein

Representation Learning (appendix)

Datasets

  • UniProt: the Universal Protein knowledgebase (UniProt)
    Rolf Apweiler, Amos Bairoch, Cathy H. Wu, Winona C. Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, Maria J. Martin, Darren A. Natale, Claire O'Donovan, Nicole Redaschi, Lai-Su L. Yeh
    Nucleic Acids Research 2004

  • OntoProtein: Protein Pretraining With Gene Ontology Embedding (ProteinKG)
    Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Jiazhang Lian, Qiang Zhang, Huajun Chen
    ICLR 2022

  • The Protein Data Bank (PDB)
    Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne
    Nucleic Acids Research 2000

  • AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models (AlphaFoldDB)
    Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar
    Nucleic Acids Research 2022

  • Pfam: The protein families database in 2021 (Pfam)
    Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A Salazar, Erik L L Sonnhammer, Silvio C E Tosatto, Lisanna Paladin, Shriya Raj, Lorna J Richardson, Robert D Finn, Alex Bateman
    Nucleic Acids Research 2021

Models

  • Unified rational protein engineering with sequence-based deep representation learning (UniRep)
    Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, George M. Church
    Nature Methods 2019

  • Prottrans: Toward understanding the language of life through self-supervised learning (ProtBERT)
    Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost
    IEEE PAMI 2021

  • Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences (ESM-1b)
    Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus
    PNAS 2021

  • MSA Transformer (MSA Transformer)
    Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, Alexander Rives
    ICML 2021

  • Retrieved Sequence Augmentation for Protein Representation Learning (RSA)
    Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong
    bioRxiv (2023)

  • OntoProtein: Protein Pretraining With Gene Ontology Embedding (OntoProtein)
    Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Jiazhang Lian, Qiang Zhang, Huajun Chen
    ICLR 2022

  • Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling (KeAP)
    Hong-Yu Zhou, Yunxiang Fu, Zhicheng Zhang, Cheng Bian, Yizhou Yu
    bioRxiv (2023)

  • Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures (IEConv)
    Pedro Hermosilla, Marco Schäfer, Matěj Lang, Gloria Fackelmann, Pere Pau Vázquez, Barbora Kozlíková, Michael Krone, Tobias Ritschel, Timo Ropinski
    ICLR 2021

  • Structure-based protein function prediction using graph convolutional networks (DeepFRI)
    Vladimir Gligorijević, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C. Taylor, Ian M. Fisk, Hera Vlamakis, Ramnik J. Xavier, Rob Knight, Kyunghyun Cho, Richard Bonneau
    Nature Communications 2021

  • Protein Representation Learning by Geometric Structure Pretraining (GearNET)
    Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang
    arXiv:2203.06125 (2022)

Structure Prediction

Datasets

  • The Protein Data Bank (PDB)
    Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne
    Nucleic Acids Research 2000

  • Critical assessment of methods of protein structure prediction (CASP)—Round XIV (CASP14)
    Andriy Kryshtafovych, Torsten Schwede, Maya Topf, Krzysztof Fidelis, John Moult
    Proteins 2021

  • Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12 (CAMEO)
    Jürgen Haas, Alessandro Barbato, Dario Behringer, Gabriel Studer, Steven Roth, Martino Bertoni, Khaled Mostaguir, Rafal Gumienny, Torsten Schwede
    Proteins 2017

Metrics

  • LGA: a method for finding 3D similarities in protein structures (GDT-TS)
    Adam Zemla
    Nucleic Acids 2003

  • Scoring function for automated assessment of protein structure template quality (TM-score)
    Yang Zhang, Jeffrey Skolnick
    Proteins 2004

  • lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests (lDDT)
    Valerio Mariani, Marco Biasini, Alessandro Barbato, Torsten Schwede
    Bioinformatics 2013

Models

  • Highly accurate protein structure prediction with AlphaFold (AlphaFold)
    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis
    Nature 2021)

  • The trRosetta server for fast and accurate protein structure prediction (trRosetta)
    Zongyang Du, Hong Su, Wenkai Wang, Lisha Ye, Hong Wei, Zhenling Peng, Ivan Anishchenko, David Baker, Jianyi Yang Nature Protocols 2021

  • Accurate prediction of protein structures and interactions using a three-track neural network (RoseTTAFold)
    Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Ebrecht, Diederik J. Opperman, Theo Sagmeister, Christoph Buhlheller, Tea Pavkov-Keller, Manoj K. Rathinaswamy, Udit Dalwadi, Calvin K. Yip, John E. Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, David Baker
    Science 2021

  • Evolutionary-scale prediction of atomic-level protein structure with a language model (ESMFold)
    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, Alexander Rives
    Science 2023

  • EigenFold: Generative Protein Structure Prediction with Diffusion Models (EigenFold)
    Bowen Jing, Ezra Erives, Peter Pao-Huang, Gabriele Corso, Bonnie Berger, Tommi Jaakkola
    arXiv:2304.02198 (2023)

Sequence Generation

Datasets

  • The Protein Data Bank (PDB)
    Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne
    Nucleic Acids Research 2000

  • UniProt: the Universal Protein knowledgebase (UniRef/UniParc)
    Rolf Apweiler, Amos Bairoch, Cathy H. Wu, Winona C. Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, Maria J. Martin, Darren A. Natale, Claire O'Donovan, Nicole Redaschi, Lai-Su L. Yeh
    Nucleic Acids Research 2004

  • CATH: comprehensive structural and functional annotations for genome sequences (CATH)
    Ian Sillitoe, Tony E. Lewis, Alison Cuff, Sayoni Das, Paul Ashford, Natalie L. Dawson, Nicholas Furnham, Roman A. Laskowski, David Lee, Jonathan G. Lees, Sonja Lehtinen, Romain A. Studer, Janet Thornton, Christine A. Orengo
    Nucleic Acids Research 2015

  • Direct prediction of profiles of sequences compatible to a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles (TS500)
    Zhixiu Li, Yuedong Yang, Eshel Faraggi, Jian Zhan, and Yaoqi Zhou
    Proteins 2014

Models

  • ProteinVAE: Variational AutoEncoder for Translational Protein Design (ProteinVAE)
    Suyue Lyu, Shahin Sowlati-Hashjin, Michael Garton
    bioRxiv (2023)

  • ProT-VAE: Protein Transformer Variational AutoEncoder for Functional Protein Design (ProT-VAE)
    Emre Sevgen, Joshua Moller, Adrian Lange, John Parker, Sean Quigley, Jeff Mayer, Poonam Srivastava, Sitaram Gayatri, David Hosfield, Maria Korshunova, Micha Livne, Michelle Gill, Rama Ranganathan, Anthony B. Costa, Andrew L. Ferguson
    bioRxiv (2023)

  • Expanding functional protein sequence spaces using generative adversarial networks (ProteinGAN)
    Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Irmantas Rokaitis, Jan Zrimec, Simona Poviloniene, Audrius Laurynenas, Sandra Viknander, Wissam Abuajwa, Otto Savolainen, Rolandas Meskys, Martin K. M. Engqvist, Aleksej Zelezniak
    Nature Machine Intelligence (2021)

  • Fast and flexible protein design using deep graph neural networks (ProteinSolver)
    Alexey Strokach, David Becerra, Carles Corbi-Verge, Albert Perez-Riba, Philip M. Kim
    Cell Systems 2020

  • PiFold: Toward effective and efficient protein inverse folding (PiFold)
    Zhangyang Gao, Cheng Tan, Stan Z. Li
    ICLR 2023

  • Protein sequence design with a learned potential
    Namrata Anand, Raphael Eguchi, Irimpan I. Mathews, Carla P. Perez, Alexander Derry, Russ B. Altman, Po-Ssu Huang
    Nature Communications 2022

  • Rotamer-free protein sequence design based on deep learning and self-consistency (ABACUS-R)
    Yufeng Liu, Lu Zhang, Weilun Wang, Min Zhu, Chenchen Wang, Fudong Li, Jiahai Zhang, Houqiang Li, Quan Chen, Haiyan Liu
    Nature Computational Science 2022

  • ProRefiner: an entropy-based refining strategy for inverse protein folding with global graph attention (ProRefiner)
    Xinyi Zhou, Guangyong Chen, Junjie Ye, Ercheng Wang, Jun Zhang, Cong Mao, Zhanwei Li, Jianye Hao, Xingxu Huang, Jin Tang, Pheng Ann Heng
    Nature Communications 2023

  • Graphormer supervised de novo protein design method and function validation (GPD)
    Junxi Mu, Zhengxin Li, Bo Zhang, Qi Zhang, Jamshed Iqbal, Abdul Wadood, Ting Wei, Yan Feng, Hai-Feng Chen
    Briefings in Bioinformatics 2024

  • Learning from Protein Structure with Geometric Vector Perceptrons (GVP-GNN)
    Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael John Lamarre Townshend, Ron Dror
    ICLR 2021

  • Learning inverse folding from millions of predicted structures (ESM-IF1)
    Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, Alexander Rives
    ICML 2022

  • Robust deep learning--based protein sequence design using ProteinMPNN (ProteinMPNN)
    J Dauparas, I Anishchenko, N Bennett, H Bai, R J Ragotte, L F Milles, B I M Wicky, A Courbet, R J de Haas, N Bethel, P J Y Leung, T F Huddy, S Pellock, D Tischer, F Chan, B Koepnick, H Nguyen, A Kang, B Sankaran, A K Bera, N P King, D Baker
    Science 2022

Backbone Design

Datasets

  • The Protein Data Bank (PDB)
    Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne
    Nucleic Acids Research 2000

  • AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models (AlphaFoldDB)
    Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar
    Nucleic Acids Research 2022

  • SCOP: A structural classification of proteins database for the investigation of sequences and structures (SCOP)
    Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, Cyrus Chothia JMB 1995

  • SCOPe: improvements to the structural classification of proteins – extended database to facilitate variant interpretation and machine learning (SCOPe)
    John-Marc Chandonia, Lindsey Guan, Shiangyi Lin, Changhua Yu, Naomi K Fox, Steven E Brenner Nucleic Acids Research 2022

  • CATH: comprehensive structural and functional annotations for genome sequences (CATH)
    Ian Sillitoe, Tony E. Lewis, Alison Cuff, Sayoni Das, Paul Ashford, Natalie L. Dawson, Nicholas Furnham, Roman A. Laskowski, David Lee, Jonathan G. Lees, Sonja Lehtinen, Romain A. Studer, Janet Thornton, Christine A. Orengo
    Nucleic Acids Research 2015

Metrics

  • Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem (scTM)
    Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, Tommi Jaakkola
    ICLR 2023

Models

  • Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem (ProtDiff)
    Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, Tommi Jaakkola
    ICLR 2023

  • Protein structure generation via folding diffusion (FoldingDiff)
    Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, Sarah Alamdari, James Y. Zou, Alex X. Lu, Ava P. Amini
    Nature Communications 2024

  • A Latent Diffusion Model for Protein Structure Generation (LatentDiff)
    Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, Shuiwang Ji
    LoG 2023

  • Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds (Genie)
    Yeqing Lin, Mohammed AlQuraishi
    arXiv:2301.12485 (2023)

  • SE(3) diffusion model with application to protein backbone generation (FrameDiff)
    Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, Tommi Jaakkola
    ICML 2023

  • De novo design of protein structure and function with RFdiffusion (RFDiffusion)
    Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Sergey Ovchinnikov, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, David Baker
    Nature 2023

  • Protein Language Model Supervised Precise and Efficient Protein Backbone Design Method (GPDL)
    Bo Zhang, Kexin Liu, Zhuoqi Zheng, Yunfeiyang Liu, Junxi Mu, Ting Wei, Hai-Feng Chen
    bioRxiv (2023)

  • Joint Design of Protein Sequence and Structure based on Motifs (GeoPro)
    Zhenqiao Song, Yunlong Zhao, Yufei Song, Wenxian Shi, Yang Yang, Lei Li
    arXiv:2310.02546 (2023)

  • An all-atom protein generative model (Protpardelle)
    Alexander E. Chu, Lucy Cheng, Gina El Nesr, Minkai Xu, Po-Ssu Huang
    bioRxiv (2023)

  • Protein Sequence and Structure Co-Design with Equivariant Translation (ProtSeed)
    Chence Shi, Chuanrui Wang, Jiarui Lu, Bozitao Zhong, Jian Tang
    ICLR 2023

Antibody

Representation Learning (appendix)

Datasets

  • Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences (OAS)
    Tobias H. Olsen, Fergus Boyles, Charlotte M. Deane
    Protein Science 2022

Models

  • Antibody Representation Learning for Drug Discovery (BERTTransformer)
    Lin Li, Esther Gupta, John Spaeth, Leslie Shing, Tristan Bepler, Rajmonda Sulo Caceres
    arXiv:2210.02881 (2022)

  • Deciphering antibody affinity maturation with language models and weakly supervised learning (AntiBERTy)
    Jeffrey A. Ruffolo, Jeffrey J. Gray, Jeremias Sulam
    arXiv:2112.07782 (2021)

  • Deciphering the language of antibodies using selfsupervised learning (AntiBERTa)
    Jinwoo Leem, Laura S. Mitchell, James H.R. Farmery, Justin Barton, Jacob D. Galson
    Patterns 2022

  • AbLang: an antibody language model for completing antibody sequences (AbLang)
    Tobias H Olsen, Iain H Moal, Charlotte M Deane
    Bioinformatics Advances 2022

  • Pre-training with A rational approach for antibody (PARA)
    Xiangrui Gao, Changling Cao, Lipeng Lai
    bioRxiv (2023)

Structure Prediction (appendix)

Datasets

  • SAbDab: the structural antibody database (SAbDab)
    James Dunbar, Konrad Krawczyk, Jinwoo Leem, Terry Baker, Angelika Fuchs, Guy Georges, Jiye Shi, Charlotte M. Deane
    Nucleic Acids Research 2014

  • RosettaAntibodyDesign (RAbD): A general framework for computational antibody design (RAB)
    Jared Adolf-Bryfogle, Oleks Kalyuzhniy, Michael Kubitz, Brian D. Weitzner, Xiaozhen Hu, Yumiko Adachi, William R. Schief, Roland L. Dunbrack, Jr.
    PLOS Computational Biology 2018

Metrics

  • Improved prediction of antibody VL–VH orientation (OCD)
    Nicholas A. Marze, Sergey Lyskov, Jeffrey J. Gray
    PEDS 2016

Models

  • tFold-Ab: Fast and Accurate Antibody Structure Prediction without Sequence Homologs (tFold-Ab)
    Jiaxiang Wu, Fandi Wu, Biaobin Jiang, Wei Liu, Peilin Zhao
    bioRxiv (2022)

  • xTrimoABFold: De novo Antibody Structure Prediction without MSA (xTrimoABFold)
    Yining Wang, Xumeng Gong, Shaochuan Li, Bing Yang, YiWu Sun, Chuan Shi, Yangang Wang, Cheng Yang, Hui Li, Le Song
    arXiv:2212.00735 (2022)

  • ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins (ABodyBuilder)
    Brennan Abanades, Wing Ki Wong, Fergus Boyles, Guy Georges, Alexander Bujotzek, Charlotte M. Deane
    Nature 2023

  • ABlooper: fast accurate antibody CDR loop structure prediction with accuracy estimation (ABlooper)
    Brennan Abanades, Guy Georges, Alexander Bujotzek, Charlotte M Deane
    Bioinformatics 2022

  • Geometric potentials from deep learning improve prediction of CDR H3 loop structures (DeepH3)
    Jeffrey A Ruffolo, Carlos Guerra, Sai Pooja Mahajan, Jeremias Sulam, Jeffrey J Gray
    Bioinformatics 2020

  • Simple End-to-end Deep Learning Model for CDR-H3 Loop Structure Prediction (SimpleDH3)
    Natalia Zenkova, Ekaterina Sedykh, Tatiana Shugaeva, Vladislav Strashko, Timofei Ermak, Aleksei Shpilman
    arXiv:2111.10656 (2021)

  • Antibody structure prediction using interpretable deep learning (DeepAB)
    Jeffrey A Ruffolo, Jeremias Sulam, Jeffrey J Gray
    Patterns 2021

  • Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies (IgFold)
    Jeffrey A Ruffolo, Lee-Shin Chu, Sai Pooja Mahajan, Jeffrey J Gray
    Nature Communications 2023

CDR Generation (appendix)

Datasets

  • SAbDab: the structural antibody database (SAbDab)
    James Dunbar, Konrad Krawczyk, Jinwoo Leem, Terry Baker, Angelika Fuchs, Guy Georges, Jiye Shi, Charlotte M. Deane
    Nucleic Acids Research 2014

  • RosettaAntibodyDesign (RAbD): A general framework for computational antibody design (RAB)
    Jared Adolf-Bryfogle, Oleks Kalyuzhniy, Michael Kubitz, Brian D. Weitzner, Xiaozhen Hu, Yumiko Adachi, William R. Schief, Roland L. Dunbrack, Jr.
    PLOS Computational Biology 2018

  • SKEMPI 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation (SKEMPI)
    Justina Jankauskaite, Brian Jiménez-García, Justas Dapkunas, Juan Fernández-Recio, Iain H Moal
    Bioinformatics 2019

Metrics

  • Scoring function for automated assessment of protein structure template quality (TM-score)
    Yang Zhang, Jeffrey Skolnick
    Proteins 2004

Models

  • In silico proof of principle of machine learning-based antibody design at unconstrained scale
    Rahmad Akbara, Philippe A. Roberta, Cédric R. Weberb, Michael Widrichc, Robert Franka, Milena Pavlovićd, Lonneke Schefferd, Maria Chernigovskayaa, Igor Snapkova, Andrei Slabodkina, Brij Bhushan Mehtaa, Enkelejda Mihoe, Fridtjof Lund-Johansena, Jan Terje Andersena,f, Sepp Hochreiterc,g, Ingrid Hobæk Haffh, Günter Klambauerc, Geir Kjetil Sandved, Victor Greiff
    mAbs 2022https://www.tandfonline.com/doi/full/10.1080/19420862.2022.2031482

  • Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design (RefineGNN)
    Wengong Jin, Jeremy Wohlwend, Regina Barzilay, Tommi Jaakkola
    ICLR 2022

  • Conditional Antibody Design as 3D Equivariant Graph Translation (MEAN)
    Xiangzhe Kong, Wenbing Huang, Yang Liu
    ICLR 2023

  • Cross-Gate MLP with Protein Complex Invariant Embedding is A One-Shot Antibody Designer (ADesigner)
    Cheng Tan, Zhangyang Gao, Lirong Wu, Jun Xia, Jiangbin Zheng, Xihong Yang, Yue Liu, Bozhen Hu, Stan Z. Li
    AAAI 2024

  • Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures (DiffAb)
    Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma
    NeurIPS 2022

  • Deep Learning for Flexible and Site-Specific Protein Docking and Design (DockGPT)
    Matt McPartlon, Jinbo Xu
    bioRxiv (2023)

  • Antibody-Antigen Docking and Design via Hierarchical Equivariant Refinement (HERN)
    Wengong Jin, Dr.Regina Barzilay, Tommi Jaakkola
    ICML 2022

  • End-to-End Full-Atom Antibody Design (dyMEAN)
    Xiangzhe Kong, Wenbing Huang, Yang Liu
    ICML 2023

Peptide

Misc. Tasks

Models

  • A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation (MMCD)
    Yongkang Wang, Xuan Liu, Feng Huang, Zhankun Xiong, Wen Zhang
    AAAI 2024

  • PepGB: Facilitating peptide drug discovery via graph neural networks (PepGB)
    Yipin Lei, Xu Wang, Meng Fang, Han Li, Xiang Li, Jianyang Zeng
    arXiv:2401.14665 (2024)

  • PepHarmony: A Multi-View Contrastive Learning Framework for Integrated Sequence and Structure-Based Peptide Encoding (PepHarmony)
    Ruochi Zhang, Haoran Wu, Chang Liu, Huaping Li, Yuqian Wu, Kewei Li, Yifan Wang, Yifan Deng, Jiahui Chen, Fengfeng Zhou, Xin Gao
    arXiv:2401.11360 (2024)

  • PEFT-SP: Parameter-Efficient Fine-Tuning on Large Protein Language Models Improves Signal Peptide Prediction (PEFT-SP)
    Shuai Zeng, Duolin Wang, Dong Xu
    bioRxiv (2023)

  • AdaNovo: Adaptive De Novo Peptide Sequencing with Conditional Mutual Information (AdaNovo)
    Jun Xia, Shaorong Chen, Jingbo Zhou, Tianze Ling, Wenjie Du, Sizhe Liu, Stan Z. Li
    arXiv:2403.07013 (2024)