Protein language model embeddings and predictions for the fly proteome (FlyBase)

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6322183

下载链接

链接失效反馈

官方服务：

资源简介：

Residue and sequence embeddings of the fly (drosophila melanogaster) proteome (FlyBase for organism drosophila melanogaster, downloaded on 2022.03.01) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). To open the embeddings file, please see this notebook. The embeddings will be indexed by numbers according to the mapping file (mapping_file.csv) in this dataset. All following results will share the same mapping (for instance, to access the variation prediction results, by accessing index "0", you will query results for the sequence "FBpp0304622"). Additionally: - Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3) - Residue-level prediction of conservation (in 9 states) and of variation effect (from 0 [no-effect] to 1 [effect]) using VESPAl (https://doi.org/10.1007/s00439-021-02411-y) Files included: - dmel-all-translation-r6.44.fasta --> FASTA-formatted sequences of drosophila melanogaster from FlyBase - mapping_file.csv --> A CSV file mapping the identifiers used in the following files (from 0 to 30737) to the identifiers in the FlyBase fasta file (dmel-all-translation-r6.44.fasta). - DSSP3_fly_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in dmel-all-translation-r6.44.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other. - subcell_fly_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in dmel-all-translation-r6.44.fasta - embeddings_file.h5 --> per-residue embeddings of sequences in dmel-all-translation-r6.44.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file. - reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in dmel-all-translation-r6.44.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension). - conspred_probs.h5 --> per-sequence conservation probability (softmax) prediction of sequences in dmel-all-translation-r6.44.fasta in 9 classes. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 9xL, with L being the length of the protein sequence, and 9 being the predicted conservation class (index 0 = very variable; index 8 = very conserved) - vespal_SAVeffect_fly.zip --> zipped .h5 file of per-sequence variation predictions of sequences in dmel-all-translation-r6.44.fasta on a scale from 0 (neutral) to 1 (effect). -1 indicates WT substitution. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length 20xL, with L being the length of the protein sequence, and 20 being the predicted variation score for each residue substitution (AAs in the following order: "ALGVSREDTIPKFQNYMHWC" . Meaning that index 0 = substitution of the residue to "A", index = 1 substitution to residue "L", aso.)

创建时间：

2022-03-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集