Flexible Protein-Protein Docking Benchmark(FD1.0)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14004827

下载链接

链接失效反馈

官方服务：

资源简介：

To effectively assess the capabilities of various methods in flexible protein-protein docking, it is essential for a protein-protein docking dataset to encompass not only the structures of the heterodimer but also that of unbound monomers. Existing datasets such as DB5.5 and AB-Benchmark, while useful, are relatively limited in scale. In contrast, the Database of Interacting Protein Structures (DIPS) contains up to 42,826 binary protein complex structures but lacks the unbound state structures of the monomers. This limitation restricts its applicability to evaluations of rigid docking models rather than flexible ones. Consequently, the impact of large-scale docking datasets on methods for flexible protein-protein docking has not been thoroughly explored. To address this gap, we introduce the Flexible Protein-Protein Docking Benchmark (FD1.0), which, to our knowledge, is currently the largest dataset dedicated to flexible protein-protein docking. By providing a large and well-characterized dataset, FD1.0 aims to foster innovation in the development of flexible docking algorithms. It allows researchers to rigorously test and refine their methods, facilitating more accurate predictions of protein interactions, which are essential for understanding biological functions and designing therapeutic interventions. In our analysis of the DIPS dataset, we identified several critical issues: (1) Multiple three-dimensional structures correspond to a single protein sequence, introducing substantial noise and affecting fair comparisons among baselines, especially for models reliant on 3D structural data. (2) The DIPS training set, primarily consisting of homo-multimers, fails to capture the diversity of interface types fully. Moreover, protein-protein docking predictions are most valuable for elucidating mechanisms of protein-protein interactions (PPIs), which predominantly involve heterodimers. Homomers, often synthesized directly rather than through docking, do not accurately represent typical PPI scenarios. (3) A significant number of docking cases in DIPS involve the interaction of one polymeric protein with another, further complicating the dataset. As a cornerstone for the flexible docking dataset, it is imperative to acquire the structures of protein monomers in their unbound state. Specifically, this can be achieved through protein structure prediction methods, such as AlphaFold2, and the aggregation of structural data from sources including electron microscopy. Additionally, acknowledging the deficiencies of the DIPS dataset, several guidelines were established in the construction process of the FD1.0 dataset: (1) Each protein monomer is associated with a unique three-dimensional structure, reducing dataset noise. (2) We ensured that the similarity score (as determined by MMSeq) between docking monomers does not exceed 0.6, thereby filtering out homodimeric pairs from the dataset. (3) Unlike DIPS, a certain proportion of cases in the which dataset actually involve docking of two protein multimer. Current methods for predicting multimeric structures, such as AlphaFold Multimer, still do not achieve satisfactory results (AlphaFold3's license prohibits its use for docking purposes). However, current methods for predicting monomeric structures have reached a high level of accuracy. Therefore, we filtered out such cases, ensuring that each docking instance involves only protein monomers, guaranteeing the quality of the dataset. By adhering to these standardized construction criteria and through the collection, cleaning, and organization of data from various sources, including the Protein Data Bank and existing datasets, we compiled 3721 entries. Following the DIPS division ratio, these entries were divided into training, validation, and test sets of 3546, 98, and 77, respectively.

创建时间：

2024-10-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集