five

Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10995109
下载链接
链接失效反馈
官方服务:
资源简介:
Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast This dataset contains mutation effect predictions for 22,169 Drosophila melanogaster protein isoforms, classifying over 293 million amino acid substitutions as neutral, uncertain, or impactful. The predictions were generated using the evolution-based GEMME model (E.Laine et al. MBE 2019) with multiple sequence alignments (MSAs) from the highly efficient ColabFold protocol (M.Mirdita et al. NatMet 2022, Abakarova et al. GBE 2023). To ensure reliability, we provide global (per-protein) and local (per-residue) confidence metrics, since the predictions are sensitive to the input MSA quality. Predictions were validated using natural polymorphisms from the Drosophila Genetic Reference Panel (DGRP) and Drosophila Evolution over Space and Time (DEST2) datasets, as well as FlyBase’s developmentally lethal and hypomorphic mutations. Additionally, the dataset includes sensitivity data for post-translational modifications (PTMs) and short linear motifs (SLiMs), aiding functional site identification. All this data can be visualized at proteocast.ijm.fr.   Readme: Drosophila_ProteoCast.tar.gz - this archive contains ProteoCast predictions and analysis for each unique proteoform, with folder names corresponding to IDs listed in the mapping_database.csv file. A detailed description of the folder structure and contents can be found in ReadMe.txt.  data.tar.gz - this archive contains the data used in this study, sourced from FlyBase, DGRP2, and DEST2. Dmel6.44PredictionsRecap.csv - this summary file provides detailed information for each proteoform, all FlyBase protein IDs (FBpp_ID) included. It contains the following data:  Identifiers: FlyBase protein ID (FBpp_ID), protein symbol (Protein_symbol), gene ID (FBgn_ID), transcript ID (FBtr_ID), and UniProt ID if available (UniProt_ID). Protein Characteristics: Sequence length (Length) and whether the proteoform is representative (Representative_FBpp). MSA and GEMME Predictions: Fraction of observed mutations (F_obs) and number of sequences (Nb_seq_MSA) in the ColabFold MSA, presence or absence of GEMME predictions (GEMME_prediction), and global confidence score (GlobalConfidence). Mutation Classification: Thresholds for defining mutations as neutral, uncertain, or impactful (GMM3_uncertain, GMM3_impactful). Genomic Information: DNA strand (Strand) and exon coordinates (Exons_coordinates). Structural Data: 3D structure file name if available (Structure_3D_file, Structure_3D). Mutation Counts: Number of analyzed mutations and affected residues, labelled as lethal, hypomorphic on FlyBase, or from the DEST2 and DGRP datasets (n_Lethal, n_Lethal_res, n_Hypomorphic, n_Hypomorphic_res, n_DEST2, n_DEST2_res, n_DGRP, n_DGRP_res, n_DEST_DGRP_union, n_DEST_DGRP_union_res). csv.tar.gz - this archive contains the files generated in this study. A detailed description of the folder structure and contents can be found in ReadMe.txt.
创建时间:
2025-02-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作