five

A dataset for predicting protein-protein interactions in humans

收藏
DataONE2025-09-16 更新2025-09-20 收录
下载链接:
https://search.dataone.org/view/sha256:a476b477047115ef6eab24b9b8d2ca4c5173ea5a2e0af1fb1a725d1577af6096
下载链接
链接失效反馈
官方服务:
资源简介:
Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resu..., , # A dataset for predicting protein-protein interactions in humans Dataset DOI: [10.5061/dryad.15dv41p84](10.5061/dryad.15dv41p84) ## Description of the data and file structure ### **protein_omicMSAs.tar.gz (17 GB)** These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named “mask,” to indicate the alignment quality at each position. In this “mask,” an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of...,
创建时间:
2025-09-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作