five

ConvergeBio/uniref50

收藏
Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/uniref50
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en license: cc-by-4.0 tags: - biology - protein - protein-sequences - uniref - uniref50 - proteomics - bioinformatics pretty_name: UniRef50 size_categories: - 10M<n<100M task_categories: - feature-extraction configs: - config_name: default data_files: - split: train path: "train-*.parquet" dataset_info: features: - name: id dtype: string - name: name dtype: string - name: updated dtype: string - name: member_count dtype: int32 - name: common_taxon dtype: string - name: common_taxon_id dtype: int32 - name: seed_id dtype: string - name: go_mf sequence: string - name: go_bp sequence: string - name: go_cc sequence: string - name: member_ids sequence: string - name: rep_member_id dtype: string - name: rep_member_id_type dtype: string - name: rep_organism dtype: string - name: rep_organism_tax_id dtype: int32 - name: rep_protein_name dtype: string - name: rep_accessions sequence: string - name: rep_uniparc_id dtype: string - name: rep_uniref90_id dtype: string - name: rep_uniref100_id dtype: string - name: rep_is_seed dtype: bool - name: sequence dtype: large_string - name: sequence_length dtype: int32 - name: sequence_crc64 dtype: string - name: sequence_xxh128 dtype: string splits: - name: train num_examples: 60315044 download_size: 18113417832 --- # UniRef50 Complete [UniRef50](https://www.uniprot.org/uniref?query=identity:0.5) dataset from UniProt, converted from XML to sharded Parquet. UniRef50 clusters sequences at 50% identity, providing the most aggressively deduplicated UniRef tier &mdash; ideal for training protein language models and building diverse, non-redundant sequence sets. **Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** &mdash; see also [UniRef90](https://huggingface.co/datasets/ConvergeBio/uniref90), [UniRef100](https://huggingface.co/datasets/ConvergeBio/uniref100), and [UniClust30](https://huggingface.co/datasets/ConvergeBio/uniclust30). ## Dataset Summary | | | |---|---| | **Clusters** | 60,315,044 | | **Shards** | 130 | | **Compressed size** | ~17 GB (zstd) | | **Sequence lengths** | 11 &ndash; 49,499 aa (median 189, mean 287) | | **Members per cluster** | 1 &ndash; 321,476 (median 1, mean 8.7) | | **GO annotation coverage** | MF 18.6% &middot; BP 12.0% &middot; CC 12.4% | | **Updated range** | 2006-10-31 to 2026-01-28 | ## Schema Each row represents one UniRef50 cluster with its representative sequence and metadata. | Column | Type | Description | |--------|------|-------------| | `id` | `string` | Cluster identifier (e.g. `UniRef50_P12345`) | | `name` | `string` | Cluster name from UniProt | | `updated` | `string` | Last update date (`YYYY-MM-DD`) | | `member_count` | `int32` | Number of sequences in the cluster | | `common_taxon` | `string` | Lowest common taxon across members | | `common_taxon_id` | `int32` | NCBI Taxonomy ID of common taxon | | `seed_id` | `string` | ID of the seed sequence | | `go_mf` | `list<string>` | GO Molecular Function terms (`GO:XXXXXXX`) | | `go_bp` | `list<string>` | GO Biological Process terms | | `go_cc` | `list<string>` | GO Cellular Component terms | | `member_ids` | `list<string>` | All member sequence IDs | | `rep_member_id` | `string` | Representative member ID | | `rep_member_id_type` | `string` | ID type (e.g. `UniProtKB ID`, `UniParc ID`) | | `rep_organism` | `string` | Source organism of representative | | `rep_organism_tax_id` | `int32` | NCBI Taxonomy ID of representative organism | | `rep_protein_name` | `string` | Protein name of representative | | `rep_accessions` | `list<string>` | UniProtKB accessions of representative | | `rep_uniparc_id` | `string` | UniParc ID of representative | | `rep_uniref90_id` | `string` | Child UniRef90 cluster ID | | `rep_uniref100_id` | `string` | Child UniRef100 cluster ID | | `rep_is_seed` | `bool` | Whether the representative is the seed sequence | | `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) | | `sequence_length` | `int32` | Length of the sequence in residues | | `sequence_crc64` | `string` | CRC64 checksum from UniProt (hex) | | `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) | ## Usage ```python from datasets import load_dataset # Stream without downloading everything ds = load_dataset("ConvergeBio/uniref50", streaming=True) for row in ds["train"]: print(row["id"], row["sequence_length"]) break # Or load fully ds = load_dataset("ConvergeBio/uniref50") ``` ## Data Processing - **Source:** `uniref50.xml.gz` from the [UniProt FTP](https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/) - **Parsing:** Streaming XML parse with `lxml.etree.iterparse`, multi-process for throughput - **Integrity:** xxHash-128 computed per sequence; CRC64 preserved from source XML - **Validation:** Passed all tiers &mdash; schema conformance, zero null/empty sequences, xxHash roundtrip, CRC64 format, GO term format, member ID consistency, and field-by-field comparison against source XML - **Format:** Sharded Parquet with zstd compression ## Source & Citation UniRef is produced by the [UniProt Consortium](https://www.uniprot.org/): > Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. > "UniRef clusters: a comprehensive and scalable alternative for improving > sequence similarity searches." *Bioinformatics* 31(6):926&ndash;932 (2015). > [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739) ## About Built by [Converge Bio](https://converge-bio.com) &mdash; accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL. ## License UniProt data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

--- language: en license: cc-by-4.0 tags: - 生物学 - 蛋白质 - 蛋白质序列 - UniRef - UniRef50 - 蛋白质组学 - 生物信息学 pretty_name: UniRef50 size_categories: - 10M<n<100M task_categories: - 特征提取 configs: - config_name: default data_files: - split: train path: "train-*.parquet" dataset_info: features: - name: id dtype: string - name: name dtype: string - name: updated dtype: string - name: member_count dtype: int32 - name: common_taxon dtype: string - name: common_taxon_id dtype: int32 - name: seed_id dtype: string - name: go_mf sequence: string - name: go_bp sequence: string - name: go_cc sequence: string - name: member_ids sequence: string - name: rep_member_id dtype: string - name: rep_member_id_type dtype: string - name: rep_organism dtype: string - name: rep_organism_tax_id dtype: int32 - name: rep_protein_name dtype: string - name: rep_accessions sequence: string - name: rep_uniparc_id dtype: string - name: rep_uniref90_id dtype: string - name: rep_uniref100_id dtype: string - name: rep_is_seed dtype: bool - name: sequence dtype: large_string - name: sequence_length dtype: int32 - name: sequence_crc64 dtype: string - name: sequence_xxh128 dtype: string splits: - name: train num_examples: 60315044 download_size: 18113417832 --- # UniRef50 本数据集为UniProt提供的完整UniRef50(https://www.uniprot.org/uniref?query=identity:0.5)数据集,已从XML格式转换为分片式Parquet格式。UniRef50以50%序列同一性进行聚类,是去重程度最高的UniRef层级,非常适合用于训练蛋白质大语言模型(Protein Large Language Model, LLM),以及构建多样化、非冗余的序列集合。 **本数据集隶属于[ConvergeBio蛋白质数据库集合](https://huggingface.co/collections/ConvergeBio/protein-database),另可参考UniRef90(https://huggingface.co/datasets/ConvergeBio/uniref90)、UniRef100(https://huggingface.co/datasets/ConvergeBio/uniref100)以及UniClust30(https://huggingface.co/datasets/ConvergeBio/uniclust30)数据集。** ## 数据集摘要 | | | |---|---| | **聚类总数** | 60,315,044 | | **分片数量** | 130 | | **压缩后大小** | ~17 GB(zstd压缩) | | **序列长度范围** | 11 ~ 49,499 个氨基酸残基(中位数189,均值287) | | **单聚类成员数** | 1 ~ 321,476(中位数1,均值8.7) | | **基因本体注释覆盖率** | 分子功能(MF)18.6% · 生物过程(BP)12.0% · 细胞组分(CC)12.4% | | **更新时间范围** | 2006-10-31 至 2026-01-28 | ## 数据结构 每一行代表一个UniRef50聚类,包含其代表序列与元数据。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | 字符串 | 聚类标识符(例如 `UniRef50_P12345`) | | `name` | 字符串 | UniProt提供的聚类名称 | | `updated` | 字符串 | 最后更新日期(格式为`YYYY-MM-DD`) | | `member_count` | int32 | 聚类内的序列总数 | | `common_taxon` | 字符串 | 所有成员共有的最低层级分类单元 | | `common_taxon_id` | int32 | 该共同分类单元的NCBI分类学ID | | `seed_id` | 字符串 | 种子序列的标识符 | | `go_mf` | 字符串序列 | 基因本体分子功能(Gene Ontology Molecular Function, GO-MF)术语(格式为`GO:XXXXXXX`) | | `go_bp` | 字符串序列 | 基因本体生物过程(Gene Ontology Biological Process, GO-BP)术语 | | `go_cc` | 字符串序列 | 基因本体细胞组分(Gene Ontology Cellular Component, GO-CC)术语 | | `member_ids` | 字符串序列 | 所有成员序列的标识符 | | `rep_member_id` | 字符串 | 代表序列的标识符 | | `rep_member_id_type` | 字符串 | 代表序列的ID类型(例如 `UniProtKB ID`、`UniParc ID`) | | `rep_organism` | 字符串 | 代表序列的来源生物 | | `rep_organism_tax_id` | int32 | 代表序列来源生物的NCBI分类学ID | | `rep_protein_name` | 字符串 | 代表序列对应的蛋白质名称 | | `rep_accessions` | 字符串序列 | 代表序列的UniProtKB登录号 | | `rep_uniparc_id` | 字符串 | 代表序列的UniParc ID | | `rep_uniref90_id` | 字符串 | 所属子UniRef90聚类的ID | | `rep_uniref100_id` | 字符串 | 所属子UniRef100聚类的ID | | `rep_is_seed` | 布尔值 | 代表序列是否为该聚类的种子序列 | | `sequence` | large_string | 代表蛋白质序列(采用大写氨基酸字母表) | | `sequence_length` | int32 | 序列的氨基酸残基长度 | | `sequence_crc64` | 字符串 | UniProt提供的CRC64校验和(十六进制格式) | | `sequence_xxh128` | 字符串 | 序列的xxHash-128哈希值(十六进制格式,构建时计算) | ## 使用方法 python from datasets import load_dataset # 流式加载数据集,无需提前下载全部数据 ds = load_dataset("ConvergeBio/uniref50", streaming=True) for row in ds["train"]: print(row["id"], row["sequence_length"]) break # 或完整加载数据集 ds = load_dataset("ConvergeBio/uniref50") ## 数据处理流程 - **数据源**:来自UniProt FTP服务器的`uniref50.xml.gz`文件(https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/) - **解析方式**:采用`lxml.etree.iterparse`进行流式XML解析,启用多进程提升处理吞吐量 - **完整性校验**:为每条序列计算xxHash-128哈希值;保留源XML中的CRC64校验和 - **验证流程**:通过所有层级的校验:包括schema一致性、无空/无效序列、xxHash往返校验、CRC64格式合规性、GO术语格式合规性、成员ID一致性,以及与源XML的逐字段比对 - **存储格式**:采用zstd压缩的分片式Parquet格式 ## 来源与引用 UniRef数据集由UniProt联盟(https://www.uniprot.org/)开发: > Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. > "UniRef clusters: a comprehensive and scalable alternative for improving > sequence similarity searches." *Bioinformatics* 31(6):926–932 (2015). > [doi:10.1093/bioinformatics/btu739](https://doi.org/10.1093/bioinformatics/btu739) ## 关于本数据集 本数据集由Converge Bio(https://converge-bio.com)构建——该公司通过生成式AI加速药物发现进程。Converge Bio开发用于蛋白质工程、抗体设计与基因表达优化的基础模型,为其计算实验室产品ConvergeAB、ConvergeGEO及ConvergeCELL提供算力支持。 ## 许可协议 UniProt数据采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。
提供机构:
ConvergeBio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作