timodonnell/afdb-1.6M

Name: timodonnell/afdb-1.6M
Creator: timodonnell
Published: 2026-03-19 15:15:39
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/timodonnell/afdb-1.6M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - protein-structure - alphafold - structural-biology size_categories: - 1M<n<10M --- # AFDB-1.6M — One Representative Structure Per Structural Cluster A deduplicated subset of [AFDB-24M](https://huggingface.co/datasets/timodonnell/afdb-24M), containing approximately 1.6 million AlphaFold Database predicted protein structures — one per structural cluster. ## How This Dataset Was Created This dataset was derived from [AFDB-24M](https://huggingface.co/datasets/timodonnell/afdb-24M) using the following procedure: 1. All ~24 million rows across 12,005 shards were scanned. 2. Rows were grouped by `struct_cluster_id` (structural cluster representative from AFDB Foldseek clustering). 3. For each unique `struct_cluster_id`, the single row with the **highest `global_plddt`** (global mean pLDDT confidence score) was selected. 4. The selected rows were written into new Parquet shards (2,000 rows each, ZSTD level 12 compression). This yields approximately 1.6 million entries — one high-confidence representative per 3D structural fold cluster. ## Dataset Summary | Property | Value | |----------|-------| | Source | [AFDB-24M](https://huggingface.co/datasets/timodonnell/afdb-24M) | | Total entries | ~1.6M (one per `struct_cluster_id`) | | Selection criterion | Highest `global_plddt` per structural cluster | | Format | Apache Parquet, ZSTD compressed (level 12) | | Splits | train (98%), val (1%), test (1%) — inherited from AFDB-24M | ## Schema Each Parquet file contains a flat table with the following columns (same schema as AFDB-24M): | Column | Type | Description | |--------|------|-------------| | `entry_id` | `string` | AFDB entry ID (e.g., `AF-A0A1C0V126-F1`) | | `uniprot_accession` | `string` | UniProt accession (e.g., `A0A1C0V126`) | | `tax_id` | `int64` | NCBI taxonomy ID | | `organism_name` | `string` | Scientific name of the organism | | `global_plddt` | `float32` | Global mean pLDDT confidence score (70–100) | | `seq_len` | `int32` | Sequence length in residues | | `seq_cluster_id` | `string` | AFDB50 sequence cluster representative entry ID | | `struct_cluster_id` | `string` | Structural cluster representative entry ID | | `split` | `string` | `train`, `val`, or `test` | | `gcs_uri` | `string` | Original GCS URI | | `cif_content` | `string` | Complete raw mmCIF file text | ## Usage ### Loading with PyArrow ```python import pyarrow.parquet as pq table = pq.read_table("shard_000000.parquet") print(table.schema) print(f"{len(table)} rows") ``` ### Loading with Pandas ```python import pandas as pd df = pd.read_parquet("shard_000000.parquet") print(df[["entry_id", "organism_name", "global_plddt", "seq_len", "split"]].head()) ``` ### Parsing Structures with Gemmi ```python import gemmi row = table.to_pydict() cif_text = row["cif_content"][0] doc = gemmi.cif.read_string(cif_text) structure = gemmi.make_structure_from_block(doc.sole_block()) model = structure[0] chain = model[0] print(f"{len(chain)} residues") ``` ## Data Source and License - **AlphaFold Database** structures are provided by DeepMind and EMBL-EBI under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). - **Cluster files** are from the [Steinegger lab](https://afdb-cluster.steineggerlab.workers.dev/), based on Foldseek clustering of AFDB v4 (Version 3 clusters). ### Citation If you use this dataset, please cite the AlphaFold Database: ```bibtex @article{varadi2022alphafold, title={AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models}, author={Varadi, Mihaly and Anyango, Stephen and Deshpande, Mandar and others}, journal={Nucleic Acids Research}, volume={50}, number={D1}, pages={D439--D444}, year={2022}, doi={10.1093/nar/gkab1061} } ``` And the AFDB cluster resource: ```bibtex @article{barrio2024clustering, title={Clustering predicted structures at the scale of the known protein universe}, author={Barrio-Hernandez, Inigo and Yeo, Jimin and Jänes, Jürgen and others}, journal={Nature}, volume={622}, pages={637--645}, year={2023}, doi={10.1038/s41586-023-06510-w} } ```

提供机构：

timodonnell

5,000+

优质数据集

54 个

任务类型

进入经典数据集