ConvergeBio/uniclust30

Name: ConvergeBio/uniclust30
Creator: ConvergeBio
Published: 2026-03-30 14:19:08
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ConvergeBio/uniclust30

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: cc-by-sa-4.0 tags: - biology - protein - protein-sequences - uniclust - uniclust30 - uniref30 - msa - multiple-sequence-alignment - proteomics - bioinformatics pretty_name: UniClust30 (UniRef30) size_categories: - 10M<n<100M task_categories: - feature-extraction configs: - config_name: default data_files: - split: train path: "train-*.parquet" dataset_info: features: - name: cluster_id dtype: string - name: representative_id dtype: string - name: sequence dtype: large_string - name: sequence_length dtype: int32 - name: sequence_xxh128 dtype: string - name: num_aligned dtype: int32 - name: a3m dtype: large_string - name: member_count dtype: int32 - name: member_ids sequence: string splits: - name: train num_examples: 36293491 --- # UniClust30 (UniRef30) Complete [UniClust30 / UniRef30](https://uniclust.mmseqs.com/) dataset (release 2023_02) from the Söding Lab, converted from HH-suite A3M format to sharded Parquet. UniClust30 clusters UniProt sequences at 30% identity and includes precomputed multiple sequence alignments (MSAs) — widely used as input for protein structure prediction (AlphaFold, ColabFold) and protein language model pretraining. **Part of the [ConvergeBio Protein Database Collection](https://huggingface.co/collections/ConvergeBio/protein-database)** — see also [UniRef50](https://huggingface.co/datasets/ConvergeBio/uniref50), [UniRef90](https://huggingface.co/datasets/ConvergeBio/uniref90), and [UniRef100](https://huggingface.co/datasets/ConvergeBio/uniref100). ## Dataset Summary | | | |---|---| | **Clusters** | 36,293,491 | | **Shards** | 629 | | **Release** | 2023_02 | | **Includes** | Precomputed A3M multiple sequence alignments per cluster | ## Schema Each row represents one UniClust30 cluster with its representative sequence, MSA, and membership information. | Column | Type | Description | |--------|------|-------------| | `cluster_id` | `string` | Cluster identifier (UniRef30 accession) | | `representative_id` | `string` | UniProt accession of the representative sequence | | `sequence` | `large_string` | Representative protein sequence (uppercase amino acid alphabet) | | `sequence_length` | `int32` | Length of the representative sequence in residues | | `sequence_xxh128` | `string` | xxHash-128 of the sequence (hex, computed at build time) | | `num_aligned` | `int32` | Number of sequences in the A3M multiple sequence alignment | | `a3m` | `large_string` | Full A3M-formatted MSA for the cluster | | `member_count` | `int32` | Number of cluster members (from mapping file) | | `member_ids` | `list<string>` | All member UniProt accessions | ## Usage ```python from datasets import load_dataset # Stream without downloading everything ds = load_dataset("ConvergeBio/uniclust30", streaming=True) for row in ds["train"]: print(row["cluster_id"], row["sequence_length"], row["num_aligned"]) break # Or load fully ds = load_dataset("ConvergeBio/uniclust30") ``` ## Data Processing - **Source:** HH-suite ffindex/ffdata A3M database and `uniref_mapping.tsv.gz` from [uniclust.mmseqs.com](https://uniclust.mmseqs.com/) - **Parsing:** Direct ffindex/ffdata binary reads; membership from mapping TSV - **Integrity:** xxHash-128 computed per sequence; A3M representative sequence verified against extracted sequence - **Validation:** Passed all tiers — schema conformance, zero null/empty sequences, xxHash roundtrip, A3M format checks, A3M–sequence consistency, member ID mapping verification, and field-by-field comparison against source ffindex/ffdata - **Format:** Sharded Parquet with zstd compression ## Source & Citation UniClust30 is produced by the [Söding Lab](https://www.mpinat.mpg.de/soeding): > Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. > "Uniclust databases of clustered and deeply annotated protein sequences and > alignments." *Nucleic Acids Res.* 45(D1):D170–D176 (2017). > [doi:10.1093/nar/gkw1081](https://doi.org/10.1093/nar/gkw1081) ## About Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL. ## License UniClust30 data is available under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).

提供机构：

ConvergeBio

5,000+

优质数据集

54 个

任务类型

进入经典数据集