protein-docs

Hugging Face2026-03-19 更新2026-03-20 收录

下载链接：

https://huggingface.co/datasets/timodonnell/protein-docs

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为'Protein Documents (Parquet)'，存储了来自AlphaFold Database v4预测结构的蛋白质残基序列和3D接触图的结构化文本文档，以Parquet文件格式保存。每个文档行包含一个蛋白质的元数据。数据集来源于'timodonnell/afdb-24M'和'timodonnell/afdb-1.6M'，提供了两种不同的文档生成方案：'deterministic-positives-only'（约24M文档）和'random-3-bins'（约1.68M文档）。数据集包含多个字段，如文档文本、AFDB条目ID、UniProt登录号、NCBI分类ID、生物体名称、全局pLDDT置信度分数、序列长度等。数据集采用基于结构聚类哈希的防泄漏训练/验证/测试分割（98/1/1比例）。适用于蛋白质结构预测、蛋白质语言模型等任务。数据集遵循CC BY 4.0许可协议。

This dataset, titled 'Protein Documents (Parquet)', stores structured text documents containing protein residue sequences and 3D contact maps predicted from the AlphaFold Database v4, saved in Parquet file format. Each document row holds the metadata for a single protein. The dataset is sourced from 'timodonnell/afdb-24M' and 'timodonnell/afdb-1.6M', and provides two distinct document generation schemes: 'deterministic-positives-only' (approximately 24 million documents) and 'random-3-bins' (approximately 1.68 million documents). It includes multiple fields such as document text, AFDB entry ID, UniProt accession number, NCBI taxonomic ID, organism name, global pLDDT confidence score, and sequence length. The dataset adopts a leakage-resistant train/validation/test split based on structure clustering hashing, with a ratio of 98/1/1. It is applicable to tasks including protein structure prediction and protein language modeling. The dataset is licensed under CC BY 4.0.

创建时间：

2026-03-13