ConvergeBio/uniref50

Name: ConvergeBio/uniref50
Creator: ConvergeBio
Published: 2026-03-30 14:19:07
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ConvergeBio/uniref50

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: cc-by-4.0 tags: - biology - protein - protein-sequences - uniref - uniref50 - proteomics - bioinformatics pretty_name: UniRef50 size_categories: - 10M<n<100M task_categories: - feature-extraction configs: - config_name: default data_files: - split: train path: "train-*.parquet" dataset_info: features: - name: id dtype: string - name: name dtype: string - name: updated dtype: string - name: member_count dtype: int32 - name: common_taxon dtype: string - name: common_taxon_id dtype: int32 - name: seed_id dtype: string - name: go_mf sequence: string - name: go_bp sequence: string - name: go_cc sequence: string - name: member_ids sequence: string - name: rep_member_id dtype: string - name: rep_member_id_type dtype: string - name: rep_organism dtype: string - name: rep_organism_tax_id dtype: int32 - name: rep_protein_name dtype: string - name: rep_accessions sequence: string - name: rep_uniparc_id dtype: string - name: rep_uniref90_id dtype: string - name: rep_uniref100_id dtype: string - name: rep_is_seed dtype: bool - name: sequence dtype: large_string - name: sequence_length dtype: int32 - name: sequence_crc64 dtype: string - name: sequence_xxh128 dtype: string splits: - name: train num_examples: 60315044 download_size: 18113417832 --- # UniRef50 Complete [UniRef50](https://www.uniprot.org/uniref?query=identity:0.5) dataset from UniProt, converted from XML to sharded Parquet. UniRef50 clusters sequences at 50% identity, providing the most aggressively deduplicated UniRef tier — ideal for training protein language models and building diverse, non-redundant sequence sets. **Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** — see also [UniRef90](https://huggingface.co/datasets/ConvergeBio/uniref90), [UniRef100](https://huggingface.co/datasets/ConvergeBio/uniref100), and [UniClust30](https://huggingface.co/datasets/ConvergeBio/uniclust30). ## Dataset Summary | | | |---|---| | **Clusters** | 60,315,044 | | **Shards** | 130 | | **Compressed size** | ~17 GB (zstd) | | **Sequence lengths** | 11 – 49,499 aa (median 189, mean 287) | | **Members per cluster** | 1 – 321,476 (median 1, mean 8.7) | | **GO annotation coverage** | MF 18.6% · BP 12.0% · CC 12.4% | | **Updated range** | 2006-10-31 to 2026-01-28 | ## Schema Each row represents one UniRef50 cluster with its representative sequence and metadata. | Column | Type | Description | |--------|------|-------------| | `id` | `string` | Cluster identifier (e.g. `UniRef50_P12345`) | | `name` | `string` | Cluster name from UniProt | | `updated` | `string` | Last update date (`YYYY-MM-DD`) | | `member_count` | `int32` | Number of sequences in the cluster | | `common_taxon` | `string` | Lowest common taxon across members | | `common_taxon_id` | `int32` | NCBI Taxonomy ID of common taxon | | `seed_id` | `string` | ID of the seed sequence | | `go_mf` | `list<string>` | GO Molecular Function terms (`GO:XXXXXXX`) | | `go_bp` | `list<string>` | GO Biological Process terms | | `go_cc` | `list<string>` | GO Cellular Component terms | | `member_ids` | `list<string>` | All member sequence IDs | | `rep_member_id` | `string` | Representative member ID | | `rep_member_id_type` | `string` | ID type (e.g. `UniProtKB ID`, `UniParc ID`) | | `rep_organism` | `string` | Source organism of representative | | `rep_organism_tax_id` | `int32` | NCBI Taxonomy ID of representative organism | | `rep_protein_name` | `string` | Protein name of representative | | `rep_accessions` | `list<string>` | UniProtKB accessions of representative | | `rep_uniparc_id` | `string` | UniParc ID of representative | | `rep_uniref90_id` | `string` | Child UniRef90 cluster ID | | `rep_uniref100_id` | `string` | Child UniRef100 cluster ID | | `rep_is_seed` | `bool` | Whether the representative is the seed sequence | | `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) | | `sequence_length` | `int32` | Length of the sequence in residues | | `sequence_crc64` | `string` | CRC64 checksum from UniProt (hex) | | `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) | ## Usage ```python from datasets import load_dataset # Stream without downloading everything ds = load_dataset("ConvergeBio/uniref50", streaming=True) for row in ds["train"]: print(row["id"], row["sequence_length"]) break # Or load fully ds = load_dataset("ConvergeBio/uniref50") ``` ## Data Processing - **Source:** `uniref50.xml.gz` from the [UniProt FTP](https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/) - **Parsing:** Streaming XML parse with `lxml.etree.iterparse`, multi-process for throughput - **Integrity:** xxHash-128 computed per sequence; CRC64 preserved from source XML - **Validation:** Passed all tiers — schema conformance, zero null/empty sequences, xxHash roundtrip, CRC64 format, GO term format, member ID consistency, and field-by-field comparison against source XML - **Format:** Sharded Parquet with zstd compression ## Source & Citation UniRef is produced by the [UniProt Consortium](https://www.uniprot.org/): > Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. > "UniRef clusters: a comprehensive and scalable alternative for improving > sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015). > [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739) ## About Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL. ## License UniProt data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

--- language: en license: cc-by-4.0 tags: - 生物学 - 蛋白质 - 蛋白质序列 - UniRef - UniRef50 - 蛋白质组学 - 生物信息学 pretty_name: UniRef50 size_categories: - 10M<n<100M task_categories: - 特征提取 configs: - config_name: default data_files: - split: train path: "train-*.parquet" dataset_info: features: - name: id dtype: string - name: name dtype: string - name: updated dtype: string - name: member_count dtype: int32 - name: common_taxon dtype: string - name: common_taxon_id dtype: int32 - name: seed_id dtype: string - name: go_mf sequence: string - name: go_bp sequence: string - name: go_cc sequence: string - name: member_ids sequence: string - name: rep_member_id dtype: string - name: rep_member_id_type dtype: string - name: rep_organism dtype: string - name: rep_organism_tax_id dtype: int32 - name: rep_protein_name dtype: string - name: rep_accessions sequence: string - name: rep_uniparc_id dtype: string - name: rep_uniref90_id dtype: string - name: rep_uniref100_id dtype: string - name: rep_is_seed dtype: bool - name: sequence dtype: large_string - name: sequence_length dtype: int32 - name: sequence_crc64 dtype: string - name: sequence_xxh128 dtype: string splits: - name: train num_examples: 60315044 download_size: 18113417832 --- # UniRef50 本数据集为UniProt提供的完整UniRef50（https://www.uniprot.org/uniref?query=identity:0.5）数据集，已从XML格式转换为分片式Parquet格式。UniRef50以50%序列同一性进行聚类，是去重程度最高的UniRef层级，非常适合用于训练蛋白质大语言模型（Protein Large Language Model, LLM），以及构建多样化、非冗余的序列集合。 **本数据集隶属于[ConvergeBio蛋白质数据库集合](https://huggingface.co/collections/ConvergeBio/protein-database)，另可参考UniRef90（https://huggingface.co/datasets/ConvergeBio/uniref90）、UniRef100（https://huggingface.co/datasets/ConvergeBio/uniref100）以及UniClust30（https://huggingface.co/datasets/ConvergeBio/uniclust30）数据集。** ## 数据集摘要 | | | |---|---| | **聚类总数** | 60,315,044 | | **分片数量** | 130 | | **压缩后大小** | ~17 GB（zstd压缩） | | **序列长度范围** | 11 ~ 49,499 个氨基酸残基（中位数189，均值287） | | **单聚类成员数** | 1 ~ 321,476（中位数1，均值8.7） | | **基因本体注释覆盖率** | 分子功能（MF）18.6% · 生物过程（BP）12.0% · 细胞组分（CC）12.4% | | **更新时间范围** | 2006-10-31 至 2026-01-28 | ## 数据结构每一行代表一个UniRef50聚类，包含其代表序列与元数据。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | 字符串 | 聚类标识符（例如 `UniRef50_P12345`） | | `name` | 字符串 | UniProt提供的聚类名称 | | `updated` | 字符串 | 最后更新日期（格式为`YYYY-MM-DD`） | | `member_count` | int32 | 聚类内的序列总数 | | `common_taxon` | 字符串 | 所有成员共有的最低层级分类单元 | | `common_taxon_id` | int32 | 该共同分类单元的NCBI分类学ID | | `seed_id` | 字符串 | 种子序列的标识符 | | `go_mf` | 字符串序列 | 基因本体分子功能（Gene Ontology Molecular Function, GO-MF）术语（格式为`GO:XXXXXXX`） | | `go_bp` | 字符串序列 | 基因本体生物过程（Gene Ontology Biological Process, GO-BP）术语 | | `go_cc` | 字符串序列 | 基因本体细胞组分（Gene Ontology Cellular Component, GO-CC）术语 | | `member_ids` | 字符串序列 | 所有成员序列的标识符 | | `rep_member_id` | 字符串 | 代表序列的标识符 | | `rep_member_id_type` | 字符串 | 代表序列的ID类型（例如 `UniProtKB ID`、`UniParc ID`） | | `rep_organism` | 字符串 | 代表序列的来源生物 | | `rep_organism_tax_id` | int32 | 代表序列来源生物的NCBI分类学ID | | `rep_protein_name` | 字符串 | 代表序列对应的蛋白质名称 | | `rep_accessions` | 字符串序列 | 代表序列的UniProtKB登录号 | | `rep_uniparc_id` | 字符串 | 代表序列的UniParc ID | | `rep_uniref90_id` | 字符串 | 所属子UniRef90聚类的ID | | `rep_uniref100_id` | 字符串 | 所属子UniRef100聚类的ID | | `rep_is_seed` | 布尔值 | 代表序列是否为该聚类的种子序列 | | `sequence` | large_string | 代表蛋白质序列（采用大写氨基酸字母表） | | `sequence_length` | int32 | 序列的氨基酸残基长度 | | `sequence_crc64` | 字符串 | UniProt提供的CRC64校验和（十六进制格式） | | `sequence_xxh128` | 字符串 | 序列的xxHash-128哈希值（十六进制格式，构建时计算） | ## 使用方法 python from datasets import load_dataset # 流式加载数据集，无需提前下载全部数据 ds = load_dataset("ConvergeBio/uniref50", streaming=True) for row in ds["train"]: print(row["id"], row["sequence_length"]) break # 或完整加载数据集 ds = load_dataset("ConvergeBio/uniref50") ## 数据处理流程 - **数据源**：来自UniProt FTP服务器的`uniref50.xml.gz`文件（https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/） - **解析方式**：采用`lxml.etree.iterparse`进行流式XML解析，启用多进程提升处理吞吐量 - **完整性校验**：为每条序列计算xxHash-128哈希值；保留源XML中的CRC64校验和 - **验证流程**：通过所有层级的校验：包括schema一致性、无空/无效序列、xxHash往返校验、CRC64格式合规性、GO术语格式合规性、成员ID一致性，以及与源XML的逐字段比对 - **存储格式**：采用zstd压缩的分片式Parquet格式 ## 来源与引用 UniRef数据集由UniProt联盟（https://www.uniprot.org/）开发： > Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. > "UniRef clusters: a comprehensive and scalable alternative for improving > sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015). > [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739) ## 关于本数据集本数据集由Converge Bio（https://converge-bio.com）构建——该公司通过生成式AI加速药物发现进程。Converge Bio开发用于蛋白质工程、抗体设计与基因表达优化的基础模型，为其计算实验室产品ConvergeAB、ConvergeGEO及ConvergeCELL提供算力支持。 ## 许可协议 UniProt数据采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。

提供机构：

ConvergeBio

5,000+

优质数据集

54 个

任务类型

进入经典数据集