five

Functional Protein Mining with Conformal Guarantees

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/records/14272215
下载链接
链接失效反馈
官方服务:
资源简介:
Datasets and files associated with Functional Protein Mining with Conformal Guarantees.  Abstract Molecular structure prediction and homology detection provide a promising path to discovering new protein function and evolutionary relationships. However, current approaches lack statistical reliability assurances, limiting their practical utility for selecting proteins for further experimental and in-silico characterization. To address this challenge, we introduce a novel approach to protein search leveraging principles from conformal prediction, offering a framework that ensures statistical guarantees with user-specified risk and provides calibrated probabilities (rather than raw ML scores) for any protein search model. Our method (1) lets users select many biologically-relevant loss metrics (i.e. false discovery rate) and assigns reliable functional probabilities for annotating genes of unknown function; (2) achieves state-of-the-art performance in enzyme classification without training new models; and (3) robustly and rapidly pre-filters proteins for computationally intensive structural alignment algorithms. Our framework enhances the reliability of protein homology detection and enables the discovery of new proteins with likely desirable functional properties. Description of files afdb_embeddings_protein_vec.npy: embeddings generated with Protein-Vec for the clustered AFDB AFDB_sequences.fasta: fasta sequences for the clustered AFDB SCOPe_multidomain_embeddings_protein_vec.npy: embeddings generated with Protein-Vec for the SCOPe Multidomain proteins detailed in the DALI prefiltering section SCOPe_multidomain.fasta: FASTA sequences for the SCOPe multidomain candidate proteins new_proteins_after_cutoff.npy: list of proteins after the date cutoff, used for exchangability tests dali_multidomain_results_csv_small.zip: zip file of the results of the DALI search of all multidomain SCOPe proteins against the clustered AFDB uniprotkb_AND_reviewed_true_2023_07_03.tsv: UniProtKB metadata for proteins used in the genes of unknown function section new_protein_embeddings.npy: Protein-Vec embeddings of proteins after date cutoff used in data generation lookup_embeddings_meta_data.tsv: Embedding metadata for UniProt genes lookup_embeddings.npy: Embedded UniProt data with Protein-Vec pfam_new_proteins.npy: dict of 100000 vs all new proteins that includes metadata on whether a match exists scope_supplement.zip: all data and files pertaining to the SCOPe hierarchical risk supplement clean_selection.zip: all data and files pertaining to the section on improved enzyme classification via selection ec_supplement.zip: all data and files pertaining to the EC hierarchical risk supplement jcvi_syn30_unknown_gene_hits.csv: hits for the genes of unknown function for JCVI Syn3.0 which meet the low (10%) false discovery rate criterion
创建时间:
2024-12-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作