electricsheepafrica/APED-African-Protein-Engineering-Dataset
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/APED-African-Protein-Engineering-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
APED(非洲蛋白质工程数据集 - 嗜热模块)是一个精选的非洲嗜热生物蛋白质结构数据集,专为蛋白质工程和热稳定生物设计的机器学习应用而设计。数据集包含500个具有19种机器学习就绪特征的蛋白质、3种新型蛋白质设计以及40个完整的设计候选序列。这些蛋白质来源于非洲极端环境中的嗜热生物,如埃塞俄比亚温泉、肯尼亚马加迪湖、吉布提的阿贝湖和阿尔及利亚温泉。数据集提供了详细的特征描述,包括UniProt编号、氨基酸序列、AlphaFold置信度、α-螺旋含量、β-折叠含量等。此外,数据集还包括了通过RFdiffusion → ProteinMPNN → AlphaFold流程生成的3种新型热稳定蛋白质骨架设计,这些设计的pLDDT均超过90%。数据集的使用方法包括通过HuggingFace的`load_dataset`函数加载或直接读取Parquet文件。未来工作计划包括推出病原体模块,涵盖非洲疟疾、结核病和HIV-1亚型C变异株的蛋白质结构,用于基于结构的药物设计。
APED: African Protein Engineering Dataset - Thermophile Module is a curated dataset of protein structures from African thermophilic organisms, designed for machine learning applications in protein engineering and heat-stable biologic design. The dataset includes 500 proteins with 19 ML-ready features, 3 novel protein designs, and 40 full design candidates. These proteins are sourced from thermophilic organisms found in African extreme environments, such as Ethiopian hot springs, Lake Magadi in Kenya, Lac Abbé in Djibouti, and Algerian thermal springs. The dataset provides detailed feature descriptions, including UniProt accession, amino acid sequence, AlphaFold confidence (mean_plddt), α-helix content (helix_fraction), β-sheet content (sheet_fraction), and more. Additionally, the dataset includes 3 novel thermostable protein backbone designs generated using the RFdiffusion → ProteinMPNN → AlphaFold pipeline, all with pLDDT scores above 90%. The dataset can be loaded using HuggingFaces `load_dataset` function or by directly reading the Parquet file. Future work includes the release of a Pathogen Module, featuring protein structures from African strains of Plasmodium falciparum (malaria), Mycobacterium africanum (TB), and HIV-1 subtype C variants, for structure-based drug design.
提供机构:
electricsheepafrica



