ConvergeBio/uniref50
收藏Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/uniref50
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
tags:
- biology
- protein
- protein-sequences
- uniref
- uniref50
- proteomics
- bioinformatics
pretty_name: UniRef50
size_categories:
- 10M<n<100M
task_categories:
- feature-extraction
configs:
- config_name: default
data_files:
- split: train
path: "train-*.parquet"
dataset_info:
features:
- name: id
dtype: string
- name: name
dtype: string
- name: updated
dtype: string
- name: member_count
dtype: int32
- name: common_taxon
dtype: string
- name: common_taxon_id
dtype: int32
- name: seed_id
dtype: string
- name: go_mf
sequence: string
- name: go_bp
sequence: string
- name: go_cc
sequence: string
- name: member_ids
sequence: string
- name: rep_member_id
dtype: string
- name: rep_member_id_type
dtype: string
- name: rep_organism
dtype: string
- name: rep_organism_tax_id
dtype: int32
- name: rep_protein_name
dtype: string
- name: rep_accessions
sequence: string
- name: rep_uniparc_id
dtype: string
- name: rep_uniref90_id
dtype: string
- name: rep_uniref100_id
dtype: string
- name: rep_is_seed
dtype: bool
- name: sequence
dtype: large_string
- name: sequence_length
dtype: int32
- name: sequence_crc64
dtype: string
- name: sequence_xxh128
dtype: string
splits:
- name: train
num_examples: 60315044
download_size: 18113417832
---
# UniRef50
Complete [UniRef50](https://www.uniprot.org/uniref?query=identity:0.5) dataset from UniProt, converted from XML to sharded Parquet. UniRef50 clusters sequences at 50% identity, providing the most aggressively deduplicated UniRef tier — ideal for training protein language models and building diverse, non-redundant sequence sets.
**Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** — see also [UniRef90](https://huggingface.co/datasets/ConvergeBio/uniref90), [UniRef100](https://huggingface.co/datasets/ConvergeBio/uniref100), and [UniClust30](https://huggingface.co/datasets/ConvergeBio/uniclust30).
## Dataset Summary
| | |
|---|---|
| **Clusters** | 60,315,044 |
| **Shards** | 130 |
| **Compressed size** | ~17 GB (zstd) |
| **Sequence lengths** | 11 – 49,499 aa (median 189, mean 287) |
| **Members per cluster** | 1 – 321,476 (median 1, mean 8.7) |
| **GO annotation coverage** | MF 18.6% · BP 12.0% · CC 12.4% |
| **Updated range** | 2006-10-31 to 2026-01-28 |
## Schema
Each row represents one UniRef50 cluster with its representative sequence and metadata.
| Column | Type | Description |
|--------|------|-------------|
| `id` | `string` | Cluster identifier (e.g. `UniRef50_P12345`) |
| `name` | `string` | Cluster name from UniProt |
| `updated` | `string` | Last update date (`YYYY-MM-DD`) |
| `member_count` | `int32` | Number of sequences in the cluster |
| `common_taxon` | `string` | Lowest common taxon across members |
| `common_taxon_id` | `int32` | NCBI Taxonomy ID of common taxon |
| `seed_id` | `string` | ID of the seed sequence |
| `go_mf` | `list<string>` | GO Molecular Function terms (`GO:XXXXXXX`) |
| `go_bp` | `list<string>` | GO Biological Process terms |
| `go_cc` | `list<string>` | GO Cellular Component terms |
| `member_ids` | `list<string>` | All member sequence IDs |
| `rep_member_id` | `string` | Representative member ID |
| `rep_member_id_type` | `string` | ID type (e.g. `UniProtKB ID`, `UniParc ID`) |
| `rep_organism` | `string` | Source organism of representative |
| `rep_organism_tax_id` | `int32` | NCBI Taxonomy ID of representative organism |
| `rep_protein_name` | `string` | Protein name of representative |
| `rep_accessions` | `list<string>` | UniProtKB accessions of representative |
| `rep_uniparc_id` | `string` | UniParc ID of representative |
| `rep_uniref90_id` | `string` | Child UniRef90 cluster ID |
| `rep_uniref100_id` | `string` | Child UniRef100 cluster ID |
| `rep_is_seed` | `bool` | Whether the representative is the seed sequence |
| `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) |
| `sequence_length` | `int32` | Length of the sequence in residues |
| `sequence_crc64` | `string` | CRC64 checksum from UniProt (hex) |
| `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) |
## Usage
```python
from datasets import load_dataset
# Stream without downloading everything
ds = load_dataset("ConvergeBio/uniref50", streaming=True)
for row in ds["train"]:
print(row["id"], row["sequence_length"])
break
# Or load fully
ds = load_dataset("ConvergeBio/uniref50")
```
## Data Processing
- **Source:** `uniref50.xml.gz` from the [UniProt FTP](https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/)
- **Parsing:** Streaming XML parse with `lxml.etree.iterparse`, multi-process for throughput
- **Integrity:** xxHash-128 computed per sequence; CRC64 preserved from source XML
- **Validation:** Passed all tiers — schema conformance, zero null/empty sequences, xxHash roundtrip, CRC64 format, GO term format, member ID consistency, and field-by-field comparison against source XML
- **Format:** Sharded Parquet with zstd compression
## Source & Citation
UniRef is produced by the [UniProt Consortium](https://www.uniprot.org/):
> Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium.
> "UniRef clusters: a comprehensive and scalable alternative for improving
> sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015).
> [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739)
## About
Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL.
## License
UniProt data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
---
language: en
license: cc-by-4.0
tags:
- 生物学
- 蛋白质
- 蛋白质序列
- UniRef
- UniRef50
- 蛋白质组学
- 生物信息学
pretty_name: UniRef50
size_categories:
- 10M<n<100M
task_categories:
- 特征提取
configs:
- config_name: default
data_files:
- split: train
path: "train-*.parquet"
dataset_info:
features:
- name: id
dtype: string
- name: name
dtype: string
- name: updated
dtype: string
- name: member_count
dtype: int32
- name: common_taxon
dtype: string
- name: common_taxon_id
dtype: int32
- name: seed_id
dtype: string
- name: go_mf
sequence: string
- name: go_bp
sequence: string
- name: go_cc
sequence: string
- name: member_ids
sequence: string
- name: rep_member_id
dtype: string
- name: rep_member_id_type
dtype: string
- name: rep_organism
dtype: string
- name: rep_organism_tax_id
dtype: int32
- name: rep_protein_name
dtype: string
- name: rep_accessions
sequence: string
- name: rep_uniparc_id
dtype: string
- name: rep_uniref90_id
dtype: string
- name: rep_uniref100_id
dtype: string
- name: rep_is_seed
dtype: bool
- name: sequence
dtype: large_string
- name: sequence_length
dtype: int32
- name: sequence_crc64
dtype: string
- name: sequence_xxh128
dtype: string
splits:
- name: train
num_examples: 60315044
download_size: 18113417832
---
# UniRef50
本数据集为UniProt提供的完整UniRef50(https://www.uniprot.org/uniref?query=identity:0.5)数据集,已从XML格式转换为分片式Parquet格式。UniRef50以50%序列同一性进行聚类,是去重程度最高的UniRef层级,非常适合用于训练蛋白质大语言模型(Protein Large Language Model, LLM),以及构建多样化、非冗余的序列集合。
**本数据集隶属于[ConvergeBio蛋白质数据库集合](https://huggingface.co/collections/ConvergeBio/protein-database),另可参考UniRef90(https://huggingface.co/datasets/ConvergeBio/uniref90)、UniRef100(https://huggingface.co/datasets/ConvergeBio/uniref100)以及UniClust30(https://huggingface.co/datasets/ConvergeBio/uniclust30)数据集。**
## 数据集摘要
| | |
|---|---|
| **聚类总数** | 60,315,044 |
| **分片数量** | 130 |
| **压缩后大小** | ~17 GB(zstd压缩) |
| **序列长度范围** | 11 ~ 49,499 个氨基酸残基(中位数189,均值287) |
| **单聚类成员数** | 1 ~ 321,476(中位数1,均值8.7) |
| **基因本体注释覆盖率** | 分子功能(MF)18.6% · 生物过程(BP)12.0% · 细胞组分(CC)12.4% |
| **更新时间范围** | 2006-10-31 至 2026-01-28 |
## 数据结构
每一行代表一个UniRef50聚类,包含其代表序列与元数据。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `id` | 字符串 | 聚类标识符(例如 `UniRef50_P12345`) |
| `name` | 字符串 | UniProt提供的聚类名称 |
| `updated` | 字符串 | 最后更新日期(格式为`YYYY-MM-DD`) |
| `member_count` | int32 | 聚类内的序列总数 |
| `common_taxon` | 字符串 | 所有成员共有的最低层级分类单元 |
| `common_taxon_id` | int32 | 该共同分类单元的NCBI分类学ID |
| `seed_id` | 字符串 | 种子序列的标识符 |
| `go_mf` | 字符串序列 | 基因本体分子功能(Gene Ontology Molecular Function, GO-MF)术语(格式为`GO:XXXXXXX`) |
| `go_bp` | 字符串序列 | 基因本体生物过程(Gene Ontology Biological Process, GO-BP)术语 |
| `go_cc` | 字符串序列 | 基因本体细胞组分(Gene Ontology Cellular Component, GO-CC)术语 |
| `member_ids` | 字符串序列 | 所有成员序列的标识符 |
| `rep_member_id` | 字符串 | 代表序列的标识符 |
| `rep_member_id_type` | 字符串 | 代表序列的ID类型(例如 `UniProtKB ID`、`UniParc ID`) |
| `rep_organism` | 字符串 | 代表序列的来源生物 |
| `rep_organism_tax_id` | int32 | 代表序列来源生物的NCBI分类学ID |
| `rep_protein_name` | 字符串 | 代表序列对应的蛋白质名称 |
| `rep_accessions` | 字符串序列 | 代表序列的UniProtKB登录号 |
| `rep_uniparc_id` | 字符串 | 代表序列的UniParc ID |
| `rep_uniref90_id` | 字符串 | 所属子UniRef90聚类的ID |
| `rep_uniref100_id` | 字符串 | 所属子UniRef100聚类的ID |
| `rep_is_seed` | 布尔值 | 代表序列是否为该聚类的种子序列 |
| `sequence` | large_string | 代表蛋白质序列(采用大写氨基酸字母表) |
| `sequence_length` | int32 | 序列的氨基酸残基长度 |
| `sequence_crc64` | 字符串 | UniProt提供的CRC64校验和(十六进制格式) |
| `sequence_xxh128` | 字符串 | 序列的xxHash-128哈希值(十六进制格式,构建时计算) |
## 使用方法
python
from datasets import load_dataset
# 流式加载数据集,无需提前下载全部数据
ds = load_dataset("ConvergeBio/uniref50", streaming=True)
for row in ds["train"]:
print(row["id"], row["sequence_length"])
break
# 或完整加载数据集
ds = load_dataset("ConvergeBio/uniref50")
## 数据处理流程
- **数据源**:来自UniProt FTP服务器的`uniref50.xml.gz`文件(https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/)
- **解析方式**:采用`lxml.etree.iterparse`进行流式XML解析,启用多进程提升处理吞吐量
- **完整性校验**:为每条序列计算xxHash-128哈希值;保留源XML中的CRC64校验和
- **验证流程**:通过所有层级的校验:包括schema一致性、无空/无效序列、xxHash往返校验、CRC64格式合规性、GO术语格式合规性、成员ID一致性,以及与源XML的逐字段比对
- **存储格式**:采用zstd压缩的分片式Parquet格式
## 来源与引用
UniRef数据集由UniProt联盟(https://www.uniprot.org/)开发:
> Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium.
> "UniRef clusters: a comprehensive and scalable alternative for improving
> sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015).
> [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739)
## 关于本数据集
本数据集由Converge Bio(https://converge-bio.com)构建——该公司通过生成式AI加速药物发现进程。Converge Bio开发用于蛋白质工程、抗体设计与基因表达优化的基础模型,为其计算实验室产品ConvergeAB、ConvergeGEO及ConvergeCELL提供算力支持。
## 许可协议
UniProt数据采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。
提供机构:
ConvergeBio



