ConvergeBio/uniref90
收藏Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/uniref90
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
tags:
- biology
- protein
- protein-sequences
- uniref
- uniref90
- proteomics
- bioinformatics
pretty_name: UniRef90
size_categories:
- 100M<n<1B
task_categories:
- feature-extraction
configs:
- config_name: default
data_files:
- split: train
path: "train-*.parquet"
dataset_info:
features:
- name: id
dtype: string
- name: name
dtype: string
- name: updated
dtype: string
- name: member_count
dtype: int32
- name: common_taxon
dtype: string
- name: common_taxon_id
dtype: int32
- name: seed_id
dtype: string
- name: go_mf
sequence: string
- name: go_bp
sequence: string
- name: go_cc
sequence: string
- name: member_ids
sequence: string
- name: rep_member_id
dtype: string
- name: rep_member_id_type
dtype: string
- name: rep_organism
dtype: string
- name: rep_organism_tax_id
dtype: int32
- name: rep_protein_name
dtype: string
- name: rep_accessions
sequence: string
- name: rep_uniparc_id
dtype: string
- name: rep_uniref50_id
dtype: string
- name: rep_uniref100_id
dtype: string
- name: rep_is_seed
dtype: bool
- name: sequence
dtype: large_string
- name: sequence_length
dtype: int32
- name: sequence_crc64
dtype: string
- name: sequence_xxh128
dtype: string
splits:
- name: train
num_examples: 188848220
download_size: 55377760927
---
# UniRef90
Complete [UniRef90](https://www.uniprot.org/uniref?query=identity:0.9) dataset from UniProt, converted from XML to sharded Parquet. UniRef90 clusters sequences at 90% identity, providing a non-redundant protein sequence resource that balances comprehensiveness with reduced redundancy.
**Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** — see also [UniRef100](https://huggingface.co/datasets/ConvergeBio/uniref100), [UniRef50](https://huggingface.co/datasets/ConvergeBio/uniref50), and [UniClust30](https://huggingface.co/datasets/ConvergeBio/uniclust30).
## Dataset Summary
| | |
|---|---|
| **Clusters** | 188,848,220 |
| **Shards** | 386 |
| **Compressed size** | ~52 GB (zstd) |
| **Sequence lengths** | 11 – 49,499 aa (median 266, mean 351) |
| **Members per cluster** | 1 – 62,973 (median 1, mean 2.8) |
| **GO annotation coverage** | MF 23.6% · BP 15.4% · CC 15.5% |
| **Updated range** | 2006-10-31 to 2026-01-28 |
## Schema
Each row represents one UniRef90 cluster with its representative sequence and metadata.
| Column | Type | Description |
|--------|------|-------------|
| `id` | `string` | Cluster identifier (e.g. `UniRef90_P12345`) |
| `name` | `string` | Cluster name from UniProt |
| `updated` | `string` | Last update date (`YYYY-MM-DD`) |
| `member_count` | `int32` | Number of sequences in the cluster |
| `common_taxon` | `string` | Lowest common taxon across members |
| `common_taxon_id` | `int32` | NCBI Taxonomy ID of common taxon |
| `seed_id` | `string` | ID of the seed sequence |
| `go_mf` | `list<string>` | GO Molecular Function terms (`GO:XXXXXXX`) |
| `go_bp` | `list<string>` | GO Biological Process terms |
| `go_cc` | `list<string>` | GO Cellular Component terms |
| `member_ids` | `list<string>` | All member sequence IDs |
| `rep_member_id` | `string` | Representative member ID |
| `rep_member_id_type` | `string` | ID type (e.g. `UniProtKB ID`, `UniParc ID`) |
| `rep_organism` | `string` | Source organism of representative |
| `rep_organism_tax_id` | `int32` | NCBI Taxonomy ID of representative organism |
| `rep_protein_name` | `string` | Protein name of representative |
| `rep_accessions` | `list<string>` | UniProtKB accessions of representative |
| `rep_uniparc_id` | `string` | UniParc ID of representative |
| `rep_uniref50_id` | `string` | Parent UniRef50 cluster ID |
| `rep_uniref100_id` | `string` | Child UniRef100 cluster ID |
| `rep_is_seed` | `bool` | Whether the representative is the seed sequence |
| `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) |
| `sequence_length` | `int32` | Length of the sequence in residues |
| `sequence_crc64` | `string` | CRC64 checksum from UniProt (hex) |
| `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) |
## Usage
```python
from datasets import load_dataset
# Stream without downloading everything
ds = load_dataset("ConvergeBio/uniref90", streaming=True)
for row in ds["train"]:
print(row["id"], row["sequence_length"])
break
# Or load fully
ds = load_dataset("ConvergeBio/uniref90")
```
## Data Processing
- **Source:** `uniref90.xml.gz` from the [UniProt FTP](https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/)
- **Parsing:** Streaming XML parse with `lxml.etree.iterparse`, multi-process for throughput
- **Integrity:** xxHash-128 computed per sequence; CRC64 preserved from source XML
- **Validation:** Passed all tiers — schema conformance, zero null/empty sequences, xxHash roundtrip, CRC64 format, GO term format, member ID consistency, and field-by-field comparison against source XML
- **Format:** Sharded Parquet with zstd compression
## Source & Citation
UniRef is produced by the [UniProt Consortium](https://www.uniprot.org/):
> Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium.
> "UniRef clusters: a comprehensive and scalable alternative for improving
> sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015).
> [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739)
## About
Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL.
## License
UniProt data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
ConvergeBio



