A dataset for predicting protein-protein interactions in humans
收藏DataONE2025-09-16 更新2025-09-20 收录
下载链接:
https://search.dataone.org/view/sha256:a476b477047115ef6eab24b9b8d2ca4c5173ea5a2e0af1fb1a725d1577af6096
下载链接
链接失效反馈官方服务:
资源简介:
Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resu..., , # A dataset for predicting protein-protein interactions in humans
Dataset DOI: [10.5061/dryad.15dv41p84](10.5061/dryad.15dv41p84)
## Description of the data and file structure
### **protein_omicMSAs.tar.gz (17 GB)**
These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named âmask,â to indicate the alignment quality at each position. In this âmask,â an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of...,
创建时间:
2025-09-17



