ConvergeBio/uniref100
收藏Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/uniref100
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
tags:
- biology
- protein
- protein-sequences
- uniref
- uniref100
- proteomics
- bioinformatics
pretty_name: UniRef100
size_categories:
- 100M<n<1B
task_categories:
- feature-extraction
configs:
- config_name: default
data_files:
- split: train
path: "train-*.parquet"
dataset_info:
features:
- name: id
dtype: string
- name: name
dtype: string
- name: updated
dtype: string
- name: member_count
dtype: int32
- name: common_taxon
dtype: string
- name: common_taxon_id
dtype: int32
- name: seed_id
dtype: string
- name: go_mf
sequence: string
- name: go_bp
sequence: string
- name: go_cc
sequence: string
- name: member_ids
sequence: string
- name: rep_member_id
dtype: string
- name: rep_member_id_type
dtype: string
- name: rep_organism
dtype: string
- name: rep_organism_tax_id
dtype: int32
- name: rep_protein_name
dtype: string
- name: rep_accessions
sequence: string
- name: rep_uniparc_id
dtype: string
- name: rep_uniref50_id
dtype: string
- name: rep_uniref90_id
dtype: string
- name: rep_is_seed
dtype: bool
- name: sequence
dtype: large_string
- name: sequence_length
dtype: int32
- name: sequence_crc64
dtype: string
- name: sequence_xxh128
dtype: string
splits:
- name: train
num_examples: 475217233
download_size: 142794964632
---
# UniRef100
Complete [UniRef100](https://www.uniprot.org/uniref?query=identity:1.0) dataset from UniProt, converted from XML to sharded Parquet. UniRef100 contains every unique protein sequence in UniProtKB plus selected UniParc records, providing the most comprehensive non-identical sequence resource available.
**Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** — see also [UniRef90](https://huggingface.co/datasets/ConvergeBio/uniref90), [UniRef50](https://huggingface.co/datasets/ConvergeBio/uniref50), and [UniClust30](https://huggingface.co/datasets/ConvergeBio/uniclust30).
## Dataset Summary
| | |
|---|---|
| **Clusters** | 475,217,233 |
| **Shards** | 970 |
| **Compressed size** | ~133 GB (zstd) |
| **Sequence lengths** | 2 – 49,499 aa (median 311, mean 392) |
| **Members per cluster** | 1 – 15,375 (median 1, mean 1.1) |
| **GO annotation coverage** | MF 38.5% · BP 26.1% · CC 25.1% |
| **Updated range** | 2006-10-31 to 2026-01-28 |
## Schema
Each row represents one UniRef100 cluster with its representative sequence and metadata.
| Column | Type | Description |
|--------|------|-------------|
| `id` | `string` | Cluster identifier (e.g. `UniRef100_P12345`) |
| `name` | `string` | Cluster name from UniProt |
| `updated` | `string` | Last update date (`YYYY-MM-DD`) |
| `member_count` | `int32` | Number of sequences in the cluster |
| `common_taxon` | `string` | Lowest common taxon across members |
| `common_taxon_id` | `int32` | NCBI Taxonomy ID of common taxon |
| `seed_id` | `string` | ID of the seed sequence |
| `go_mf` | `list<string>` | GO Molecular Function terms (`GO:XXXXXXX`) |
| `go_bp` | `list<string>` | GO Biological Process terms |
| `go_cc` | `list<string>` | GO Cellular Component terms |
| `member_ids` | `list<string>` | All member sequence IDs |
| `rep_member_id` | `string` | Representative member ID |
| `rep_member_id_type` | `string` | ID type (e.g. `UniProtKB ID`, `UniParc ID`) |
| `rep_organism` | `string` | Source organism of representative |
| `rep_organism_tax_id` | `int32` | NCBI Taxonomy ID of representative organism |
| `rep_protein_name` | `string` | Protein name of representative |
| `rep_accessions` | `list<string>` | UniProtKB accessions of representative |
| `rep_uniparc_id` | `string` | UniParc ID of representative |
| `rep_uniref50_id` | `string` | Parent UniRef50 cluster ID |
| `rep_uniref90_id` | `string` | Parent UniRef90 cluster ID |
| `rep_is_seed` | `bool` | Whether the representative is the seed sequence |
| `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) |
| `sequence_length` | `int32` | Length of the sequence in residues |
| `sequence_crc64` | `string` | CRC64 checksum from UniProt (hex) |
| `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) |
## Usage
```python
from datasets import load_dataset
# Stream without downloading everything
ds = load_dataset("ConvergeBio/uniref100", streaming=True)
for row in ds["train"]:
print(row["id"], row["sequence_length"])
break
# Or load fully
ds = load_dataset("ConvergeBio/uniref100")
```
## Data Processing
- **Source:** `uniref100.xml.gz` from the [UniProt FTP](https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/)
- **Parsing:** Streaming XML parse with `lxml.etree.iterparse`, multi-process for throughput
- **Integrity:** xxHash-128 computed per sequence; CRC64 preserved from source XML
- **Validation:** Passed all tiers — schema conformance, zero null/empty sequences, xxHash roundtrip, CRC64 format, GO term format, member ID consistency, and field-by-field comparison against source XML
- **Format:** Sharded Parquet with zstd compression
## Source & Citation
UniRef is produced by the [UniProt Consortium](https://www.uniprot.org/):
> Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium.
> "UniRef clusters: a comprehensive and scalable alternative for improving
> sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015).
> [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739)
## About
Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL.
## License
UniProt data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
ConvergeBio



